数据集:

allenai/c4

中文

This is the processed version of Google's C4 dataset .

We prepared five variants of the data: en , en.noclean , en.noblocklist , realnewslike , and multilingual .

For reference, these are the sizes of the variants:

  • en : 305GB
  • en.noclean : 2.3TB
  • en.noblocklist : 380GB
  • realnewslike : 15GB
  • multilingual : 9.7TB

The en.noblocklist variant is exactly the same as the en variant, except we turned off the so-called "badwords filter", which removes all documents that contain words from the lists at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words .

How do I download this?

Unfortunately we ran out of time making this into a proper Huggingface dataset, accessible through the datasets Python package. Until we get that ready, please use git to do the download. First, make sure you have Git Large File Storage installed. Once that is done, downloading the whole dataset, all three variants, is easy:

git clone https://huggingface.co/datasets/allenai/c4

This will download 13TB to your local drive. If you want to be more precise with what you are downloading, follow these commands instead:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "en/*"

The git clone command in this variant will download a bunch of stub files that Git LFS uses, so you can see all the filenames that exist that way. You can then convert the stubs into their real files with git lfs pull --include "..." . For example, if you wanted all the Dutch documents from the multilingual set, you would run

git lfs pull --include "multilingual/c4-nl.*.json.gz"

Acknowledgements

Big ups to the good folks at Common Crawl whose data made this possible ( consider donating !), to Google for creating the code that curates and filters the data, and to Huggingface, who had no issue with hosting these 3TB of data for public download!

License

We are releasing this dataset under the terms of ODC-BY . By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.