数据集:
cc100
该语料库是为了重建用于训练XLM-R的数据集而创建的。该语料库包含100多种语言的单语数据,还包括罗马化语言的数据(以*_rom表示)。这是使用CC-Net存储库提供的URL和段落索引,通过处理2018年1月至12月的Commoncrawl快照来构建的。
CC-100主要用于预训练语言模型和单词表示。
要加载不在配置中的语言,您只需要在配置中指定语言代码。您可以在数据集描述的主页部分找到有效的语言代码: https://data.statmt.org/cc-100/ 。例如:
dataset = load_dataset("cc100", lang="en")
配置为am的示例:
{'id': '0', 'text': 'ተለዋዋጭ የግድግዳ አንግል ሙቅ አንቀሳቅሷል ቲ-አሞሌ አጥቅሼ ...\n'}
每个数据点都是一段文本。段落按原始(未打乱)顺序呈现。文档由一个单独的新行字符分隔。
数据字段为:
一些配置的大小:
name | train |
---|---|
am | 3124561 |
sr | 35747957 |
[需要更多信息]
[需要更多信息]
初始数据收集和规范化
[需要更多信息]
源语言制作人是谁?数据来自多种语言的多个网页。
数据集不包含任何其他注释。
注释过程[N/A]
注释者是谁?[N/A]
由于是从Common Crawl构建的,可能包含个人和敏感信息。在使用CC-100训练深度学习模型之前,必须考虑此问题,尤其是在文本生成模型的情况下。
[需要更多信息]
[需要更多信息]
[需要更多信息]
该数据集是由 Statistical Machine Translation at the University of Edinburgh 使用 CC-Net 的Facebook Research工具包准备的。
爱丁堡大学的统计机器翻译对原始语料的知识产权没有任何主张。使用本数据集时,您还受到与数据集中的内容相关的 Common Crawl terms of use 的约束。
@inproceedings{conneau-etal-2020-unsupervised, title = "Unsupervised Cross-lingual Representation Learning at Scale", author = "Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.747", doi = "10.18653/v1/2020.acl-main.747", pages = "8440--8451", abstract = "This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6{\%} average accuracy on XNLI, +13{\%} average F1 score on MLQA, and +2.4{\%} F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7{\%} in XNLI accuracy for Swahili and 11.4{\%} for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code and models publicly available.", }
@inproceedings{wenzek-etal-2020-ccnet, title = "{CCN}et: Extracting High Quality Monolingual Datasets from Web Crawl Data", author = "Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm{\'a}n, Francisco and Joulin, Armand and Grave, Edouard", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.494", pages = "4003--4012", abstract = "Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.", language = "English", ISBN = "979-10-95546-34-4", }
感谢 @abhishekkrthakur 添加了该数据集。