CodeParrot ? Dataset Cleaned

What is it?

A dataset of Python files from Github. This is the deduplicated version of the codeparrot .

Processing

The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:

Deduplication
- Remove exact matches
Filtering
- Average line length < 100
- Maximum line length < 1000
- Alpha numeric characters fraction > 0.25
- Remove auto-generated files (keyword search)

For more details see the preprocessing script in the transformers repository here .

Splits

The dataset is split in a train and validation split used for training and evaluation.

Structure

This dataset has ~50GB of code and 5361373 files.

DatasetDict({
    train: Dataset({
        features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
        num_rows: 5361373
    })
})

作者:

codeparrot

数据集大小:

10.71 GB