数据集:

nthngdy/oscar-mini

计算机处理:

multilingual

语言创建人:

found

批注创建人:

no-annotation

源数据集:

oscar

预印本库:

arxiv:2010.14571

许可:

cc0-1.0
英文

警告:此数据集是OSCAR数据集的一个提取,用于模拟在资源匮乏的环境中使用完整数据集以及用于调试最终将使用原始OSCAR数据集的代码库。

使用此数据集在法律上等同于使用OSCAR的经过处理的版本。我对原始数据的收集不承担任何责任,因此完全参考下方的原始数据集。

数据集“oscar”的数据卡

数据集摘要

OSCAR(O pen S uper-large C rawled A LMAnaCH co R pus)是通过使用 Common Crawl 架构对 goclassy 语料库进行语言分类和过滤而获得的巨大多语言语料库。数据以原始和去重形式按语言分发。

支持的任务和排行榜

OSCAR主要用于预训练语言模型和词汇表示。

语言

所有数据均按语言分发,提供原始和去重版本的数据。共提供166种不同语言。数据拆分样本大小的子节中提供了每个子语料库的语言代码以及原始和去重版本OSCAR的字数(以空格分隔的标记)、行数和大小的数量。

数据集结构

我们展示数据集的所有配置的详细信息。

数据集创建

整理理由

OSCAR是从 fastText's one 导出的新管道构建的。Goclassy重用 fastText linear classifier 和预训练的fastText模型进行语言识别,但它以异步的方式完全重新编写并并行化了他们的管道。

操作的顺序与fastText预处理管道中的操作更或多或少相同,但是,与将多个操作聚合到单个阻塞进程中不同,启动了每个操作的工作进程,但通过将一次可能的并行操作数量限制在可用线程数而不是CPU数来限制给定时间内。 Goclassy是在 Go programming language 中实现的,因此它让 Go runtime 处理进程的调度。因此,goclassy的管道不必等待整个WET文件下载、解压和分类,才能开始下载和处理下一个文件,只要调度器能够分配新的进程,就会开始下载和处理新的文件。

在将每行传递给分类器之前,对行级过滤和清理处理进行处理。长度小于100个UTF-8字符的行和包含无效UTF-8字符的行将被丢弃并且不会被分类。在处理所有文件后,构建去重版本,然后将所有内容分割成片段并进行压缩。

源数据

初始数据收集和归一化

Common Crawl 是一个非营利基金会,他们生成并维护一个既可以访问又可以分析的开放网络数据存储库。Common Crawl的完整网络存档包含8年的网络爬行所收集的PB级数据。该存储库包含原始网页HTML数据(WARC文件),元数据提取(WAT文件)和纯文本提取(WET文件)。该组织的网络爬虫始终遵守 nofollow robots.txt 政策。

每个月的Common Crawl快照本身就是一个庞大的多语言语料库,每个文件中的数据来自写在各种语言中的多个网页,涵盖了所有可能的主题类型。

为了构建OSCAR,使用了Common Crawl的WET文件。这些文件包含从网站提取的纯文本,大部分经过转换为UTF-8,以及包含每个爬取文档的元数据的标题。每个WET文件都以gzip格式压缩,并存储在亚马逊网络服务上。在OSCAR的情况下,使用了 2018年11月 的快照。它超过20TB的未压缩数据,包含超过5万个纯文本文件,每个文件由来自多个网站的纯文本及其元数据头组成。

源语言制片商是谁?

这些数据来自各种语言中的多个网页。

注释

数据集不包含任何额外的注释。

注释过程

N/A

注释者是谁?

N/A

个人和敏感信息

由于是从Common Crawl构建的,可能会包含个人和敏感信息。在使用OSCAR进行深度学习模型训练时,特别是对于文本生成模型,必须考虑到这一点。

使用数据的注意事项

数据的社会影响

OSCAR旨在为各种语言带来更多的数据,该语料库的目标是为较低资源语言提供大量数据,以便简化最先进的语言建模架构的预训练。

偏见讨论

OSCAR尚未经过适当的筛选,这可能会反映在使用它训练的模型上。特别需要注意所得模型的偏见。

其他已知限制

fastText linear classifier 在性能和能够识别的语言种类上都有限制,因此某些OSCAR子语料库的质量可能低于预期,特别是对于低资源语言。 third parties 已经进行了一些审核。

附加信息

数据集策划者

该语料库由 Pedro J. Ortiz Benoît Sagot Laurent Romary Inria 的工作中共同完成,特别是在 ALMAnaCH team 进行的工作中。

许可信息

These data are released under this licensing scheme
We do not own any of the text from which these data has been extracted.
We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/
To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR
This work is published from: France.

Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
* Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
* Clearly identify the copyrighted work claimed to be infringed.
* Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

引用信息

@inproceedings{ortiz-suarez-etal-2020-monolingual,
    title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages",
    author = "Ortiz Su{'a}rez, Pedro Javier  and
      Romary, Laurent  and
      Sagot, Benoit",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.156",
    pages = "1703--1714",
    abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.",
}

@inproceedings{OrtizSuarezSagotRomary2019,
  author    = {Pedro Javier {Ortiz Su{'a}rez} and Benoit Sagot and Laurent Romary},
  title     = {Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures},
  series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019},
  editor    = {Piotr Bański and Adrien Barbaresi and Hanno Biber and Evelyn Breiteneder and Simon Clematide and Marc Kupietz and Harald L{"u}ngen and Caroline Iliadi},
  publisher = {Leibniz-Institut f{"u}r Deutsche Sprache},
  address   = {Mannheim},
  doi       = {10.14618/ids-pub-9021},
  url       = {http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215},
  pages     = {9 -- 16},
  year      = {2019},
  abstract  = {Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.},
  language  = {en}
}

贡献

感谢 @pjox @lhoestq 增加了这个数据集。