数据集:

codeparrot/github-code-clean

许可:

apache-2.0
中文

This is a cleaner version of Github-code dataset , we add the following filters:

  • Average line length < 100
  • Alpha numeric characters fraction > 0.25
  • Remove auto-generated files (keyword search)

3.39M files are removed making up 2.94% of the dataset.