数据集:

bertin-project/mc4-sampling

任务:

文本生成

填充掩码

子任务:

language-modeling

语言:

计算机处理:

multilingual

大小:

n<1K 1K<n<10K 10K<n<100K

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:1910.10683

许可:

odc-by

数据集介绍文件清单

英文

mC4-sampling的数据集卡片

数据集概述

该数据集基于AllenAI版本的原始mC4数据集进行构建，并添加了实时使用困惑度过滤的抽样方法。请参考 BERTIN Project 。

原始数据集是mC4，即Common Crawl的网络爬取数据的多语言巨大清洗版本。基于Common Crawl数据集：" https://commoncrawl.org" 。

提供了108种语言，并在 mc4 dataset 中有详细报告。

您可以按照以下方式加载任何语言的mC4子集：

from datasets import load_dataset

en_mc4 = load_dataset("mc4", "en")

如果您甚至可以指定语言列表：

from datasets import load_dataset

mc4_subset_with_five_languages = load_dataset("mc4", languages=["en", "fr", "es", "de", "zh"])

数据集抽样

有三种主要的方法可以使用此数据集获取mc4的抽样版本。

随机

这可能是最简单的方法。它根据我们称之为"factor"的概率阈值保留一个文档。默认的随机采样概率是0.5：

def _should_keep_doc_random(self, doc, factor=None, **kwargs):
    factor = 0.5 if factor is None else factor
    return self.rng.uniform() <= factor

使用此抽样方法的方式是在数据集的实例化中添加额外的参数：

from datasets import load_dataset

mc4random = load_dataset(
    "bertin-project/mc4-sampling", "es",
    split="train",
    streaming=True,
    sampling_method="random",
    factor=0.5,
)
for sample in mc4random:
    print(sample)
    break

高斯

此采样方法试图调整到给定语言下文档的困惑度分布。它超采样困惑度分布的中央四分位数。两个参数控制逼近的形状："factor"（指数函数的峰度）和"width"（分布的展宽）。为西班牙语选择了默认值。

def _should_keep_doc_gaussian(self, doc, factor=None, width=None, boundaries=None, **kwargs):
    perplexity = self.get_perplexity(doc)
    width = (9 / 2) if width is None else width
    factor = 0.78 if factor is None else factor
    median = 662247.50212365 if boundaries is None else boundaries[1]
    exponential = np.exp((-1 / width) * ((perplexity - median) / median) ** 2)
    weighted_perplexity = factor * exponential
    return self.rng.uniform() < weighted_perplexity

要使用此采样方法，需要预先计算基础分布的四分位边界信息，并将其传递给数据集的实例化。此外，还需要传入指向 KenLM model （5-gram语言模型）的路径或具有方法.score(text:str) -> float的对象，以计算文档的困惑度值。可以使用pip安装KenLM：

pip install https://github.com/kpu/kenlm/archive/master.zip

from datasets import load_dataset

mc4gaussian = load_dataset(
    "bertin-project/mc4-sampling",
    "es",
    split="train",
    streaming=True,
    sampling_method="gaussian",
    perplexity_model="./es.arpa.bin",
    boundaries=[536394.99320948, 662247.50212365, 919250.87225178],
    factor=0.78,
    width=9/2,
)
for sample in mc4gaussian:
    print(sample)
    break

Facebook已经创建并发布了100种语言的5-gram Kneser-Ney模型，可在KenLM库中下载和使用。要下载自己的Kneser-Ney语言模型，请从下面的列表中选择语言代码：

af,ar,az,be,bg,bn,ca,cs,da,de,el,en,es,et,fa,fi,fr,gu,he,hi,hr,hu,hy,id,is,it,ja,ka,kk,km,kn,ko,lt,lv,mk,ml,mn,mr,my,ne,nl,no,pl,pt,ro,ru,uk,zh

并运行以下下载命令，将"lang"替换为您自己的语言代码：

wget http://dl.fbaipublicfiles.com/cc_net/lm/lang.arpa.bin

分步方法

分步抽样方法使用简单的标准，通过逆比例过采样中央四分位数。只需要"boundaries"、"factor"（过采样的强度）和"perplexity_model"这三个参数：

def _should_keep_doc_step(self, doc, factor=None, boundaries=None, **kwargs):
    perplexity = self.get_perplexity(doc)
    factor = 1.5e5 if factor is None else factor
    if boundaries is None:
        boundaries = [536394.99320948, 662247.50212365, 919250.87225178]
    if perplexity <= boundaries[0]:
        quartile_range = boundaries[0]
    elif boundaries[0] < perplexity < boundaries[1]:
        quartile_range = boundaries[1] - boundaries[0]
    elif boundaries[1] < perplexity < boundaries[2]:
        quartile_range = boundaries[2] - boundaries[1]
    elif perplexity >= boundaries[2]:
        quartile_range = 10 * boundaries[2]
    probability = factor / quartile_range
    return self.rng.uniform() < probability

要使用此抽样方法，需要进行类似的调用：

mc4stepwsie = load_dataset(
    "bertin-project/mc4-sampling",
    "es",
    split="train",
    streaming=True,
    sampling_method="stepwise",
    perplexity_model="./es.arpa.bin",
    boundaries=[536394.99320948, 662247.50212365, 919250.87225178],
    factor=1.5e5,
)
for sample in mc4stepwsie:
    print(sample)
    break

支持的任务和排行榜

mC4-sampling主要用于在预算限制下对语言模型和词表示进行预训练。

语言

该数据集支持108种语言。

数据集结构

数据实例

en配置文件的示例如下：

{'timestamp': '2018-06-24T01:32:39Z',
 'text': 'Farm Resources in Plumas County\
Show Beginning Farmer Organizations & Professionals (304)\
There are 304 resources serving Plumas County in the following categories:\
Map of Beginning Farmer Organizations & Professionals serving Plumas County\
Victoria Fisher - Office Manager - Loyalton, CA\
Amy Lynn Rasband - UCCE Plumas-Sierra Administrative Assistant II - Quincy , CA\
Show Farm Income Opportunities Organizations & Professionals (353)\
There are 353 resources serving Plumas County in the following categories:\
Farm Ranch And Forest Retailers (18)\
Map of Farm Income Opportunities Organizations & Professionals serving Plumas County\
Warner Valley Wildlife Area - Plumas County\
Show Farm Resources Organizations & Professionals (297)\
There are 297 resources serving Plumas County in the following categories:\
Map of Farm Resources Organizations & Professionals serving Plumas County\
There are 57 resources serving Plumas County in the following categories:\
Map of Organic Certification Organizations & Professionals serving Plumas County',
 'url': 'http://www.californialandcan.org/Plumas/Farm-Resources/'}

数据字段

数据有几个字段：

url：作为字符串的源URL
text：作为字符串的文本内容
timestamp：作为字符串的时间戳

数据拆分

与 mC4 are available 中的相同拆分。

附加信息

许可信息

BERTIN项目以与AllenAI发布mC4相同的条件发布此数据集，即ODC-BY的条款。使用此数据集，您还受Common Crawl使用条款的约束，以尊重数据集中所包含的内容。

引用信息

引用此数据集的方法：

@article{BERTIN,
    author = {Javier De la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury},
    title = {{BERTIN}: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling},
    journal = {Procesamiento del Lenguaje Natural},
    volume = {68},
    number = {0},
    year = {2022},
    keywords = {},
    abstract = {The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pretraining sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name perplexity sampling that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.},
    issn = {1989-7553},
    url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403},
    pages = {13--23}
}

如果您使用此数据集，我们非常乐意听到您的想法！可以在Twitter、GitHub、Discord上与我们联系，或给我们发送电子邮件。

要引用原始的mc4数据集：

@article{2019t5,
    author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
    title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
    journal = {arXiv e-prints},
    year = {2019},
    archivePrefix = {arXiv},
    eprint = {1910.10683},
}

贡献

由 @versae 贡献的数据集。

感谢 @dirkgr 和 @lhoestq 为添加原始的mC4数据集。

作者:

bertin-project

数据集大小:

34.73 KB