数据集:

nthngdy/oscar-small

计算机处理:

multilingual

语言创建人:

found

批注创建人:

no-annotation

源数据集:

oscar

预印本库:

arxiv:2010.14571

许可:

cc0-1.0
英文

警告:这个数据集是Oscar数据集的一个抽取,目的是模拟在资源匮乏的情况下使用完整数据集。

在法律上来说,使用这个数据集等于使用Oscar的经过处理的版本。我对原始数据的收集没有任何功劳,因此完全参考原始数据集。

“oscar”数据集卡片

数据集概述

OSCAR或开放超大规模 A LMAnaCH 协议是通过使用 goclassy 架构对 Common Crawl 协议进行语言分类和过滤而获得的一个庞大的多语种语料库。数据按语言分发,同时提供原始数据和去重后的数据。

支持的任务和排行榜

OSCAR主要用于预训练语言模型和词表示。

语言

所有数据按语言分发,提供原始数据和去重后数据的数量和大小。共提供166种不同语言。在“数据拆分样本大小”子部分的表格中提供了每个子语料库的语言代码以及原始版本和去重版本的字数(以空格分隔的标记)、行数和大小信息。

数据集结构

我们展示了数据集的所有配置的详细信息。

数据集创建

策划理由

OSCAR是使用 fastText's one 衍生的新管道构建的,称为 goclassy 。Goclassy重用了 fastText linear classifier 和预训练的FastText模型进行语言识别,但它以异步方式完全重写和并行化了他们的管道。

操作顺序与FastText预处理管道中的操作大致相同,但不同的是,将多个操作聚合到单个阻塞进程中的做法被更改为为每个操作启动一个工作进程,但通过可用线程数而不是CPU数限制同时可能并行操作的数量。 Goclassy是用 Go programming language 实现的,因此它让 Go runtime 来处理进程的调度。因此,goclassy的管道无需等待完整的WET文件下载、解压缩和分类即可开始下载和处理下一个文件,只需调度程序能够分配一个新进程,就会开始下载和处理新文件。

在将每行提交给分类器之前,首先对每行进行长度小于100个UTF-8字符的过滤和清理,并丢弃包含无效UTF-8字符的行。在处理完所有文件后,构建去重版本,然后将所有内容分割成分片并进行压缩。

源数据

初始数据收集和归一化

Common Crawl 是一个非盈利基金会,负责产生和维护一个可访问和可分析的开放式网络抓取数据存储库。Common Crawl的完整网络存档包含了经过8年的网络抓取收集的以PB计量的数据。该存储库包含原始的网页HTML数据(WARC文件)、元数据提取(WAT文件)和纯文本提取(WET文件)。该组织的网络爬虫始终尊重 nofollow robots.txt 政策。

每个月的Common Crawl快照本身就是一个庞大的多语种语料库,每个文件都包含来自多个以各种语言编写的网页的数据,涵盖了各种主题。

为构建OSCAR,使用了Common Crawl的WET文件。这些文件包含了从网站上提取的大部分转换为UTF-8的纯文本,以及包含每个已爬取文档的元数据的头部。每个WET文件以gzip格式进行压缩,并存储在Amazon Web Services上。在Oscar的情况下,使用了2018年11月的快照。它的未压缩数据超过20TB,包含了超过5万个纯文本文件,其中每个文件由多个网站的纯文本及其元数据头部组成。

谁是源语言的生产者?

数据来自于大量不同语言的网页。

注释

数据集不包含任何其他注释。

注释过程

N/A

谁是注释者?

N/A

个人和敏感信息

OSCAR是从Common Crawl构建的,可能包含个人和敏感信息。在使用OSCAR训练深度学习模型(尤其是文本生成模型)之前,这个问题必须被考虑到。

使用数据的注意事项

对数据的社会影响

OSCAR旨在为各种语言带来更多的数据,该语料库旨在为低资源语言提供大量数据,以便促进最先进的语言建模架构的预训练。

偏见讨论

OSCAR尚未经过适当的过滤,这可能会在使用该数据集进行训练时反映在模型中。特别要注意所得模型的偏见。

其他已知限制

fastText linear classifier 在性能和可识别语言的多样性方面都受到限制,因此某些OSCAR子语料库的质量可能低于预期,特别是对于资源最少的语言。 third parties 已经进行了一些审核。

附加信息

数据集策划者

Pedro J. Ortiz Benoît Sagot Laurent Romary Inria ,特别是在 ALMAnaCH team 的工作期间完成了这个语料库的编制。

许可信息

These data are released under this licensing scheme
We do not own any of the text from which these data has been extracted.
We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/
To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR
This work is published from: France.

Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
* Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
* Clearly identify the copyrighted work claimed to be infringed.
* Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

引用信息

@inproceedings{ortiz-suarez-etal-2020-monolingual,
    title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages",
    author = "Ortiz Su{'a}rez, Pedro Javier  and
      Romary, Laurent  and
      Sagot, Benoit",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.156",
    pages = "1703--1714",
    abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.",
}

@inproceedings{OrtizSuarezSagotRomary2019,
  author    = {Pedro Javier {Ortiz Su{'a}rez} and Benoit Sagot and Laurent Romary},
  title     = {Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures},
  series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019},
  editor    = {Piotr Bański and Adrien Barbaresi and Hanno Biber and Evelyn Breiteneder and Simon Clematide and Marc Kupietz and Harald L{"u}ngen and Caroline Iliadi},
  publisher = {Leibniz-Institut f{"u}r Deutsche Sprache},
  address   = {Mannheim},
  doi       = {10.14618/ids-pub-9021},
  url       = {http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215},
  pages     = {9 -- 16},
  year      = {2019},
  abstract  = {Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.},
  language  = {en}
}

贡献

感谢 @pjox @lhoestq 添加了这个数据集。