数据集:

bertin-project/mc4-es-sampled

英文

mC4-es-sampled数据集卡片

数据集概要

该数据集是对mC4中西班牙语部分应用困惑度采样的结果,采用了 mc4-sampling 方法。请参考 BERTIN Project

您可以使用以下方法加载mC4西班牙语抽样:

from datasets import load_dataset

for config in ("random", "stepwise", "gaussian"):
    mc4es = load_dataset(
        "bertin-project/mc4-es-sampled",
        config,
        split="train",
        streaming=True
    ).shuffle(buffer_size=1000)
    for sample in mc4es:
        print(config, sample)
        break       

或者,您可以绕过datasets库,快速下载(根据连接速度约1.5小时)与预训练BERTIN模型使用相同顺序的特定配置,以大约(200GB)JSON行文件的形式:

import io
import gzip
import json
import sys

import requests
from tqdm import tqdm

_DATA_URL_TRAIN = "https://huggingface.co/datasets/bertin-project/mc4-es-sampled/resolve/main/mc4-es-train-50M-{config}-shard-{index:04d}-of-{n_shards:04d}.json.gz"


def main(config="stepwise"):
    data_urls = [
        _DATA_URL_TRAIN.format(
            config=config,
            index=index + 1,
            n_shards=1024,
        )
        for index in range(1024)
    ]
    with open(f"mc4-es-train-50M-{config}.jsonl", "w") as f:
        for dara_url in tqdm(data_urls):
            response = requests.get(dara_url)
            bio = io.BytesIO(response.content)
            with gzip.open(bio, "rt", encoding="utf8") as g:
                for line in g:
                    json_line = json.loads(line.strip())
                    f.write(json.dumps(json_line) + "\
")


if __name__ == "__main__":
    main(sys.argv[1])

支持的任务和排行榜

mC4-es-sampled主要用于BERTIN项目的可重现性和在中等预算上预训练语言模型和词表示。

语言

该数据集仅支持西班牙语。

数据集结构

数据实例

来自Gaussian配置的示例:

{'timestamp': '2018-10-20T06:20:53Z', 'text': 'Ortho HyaluroTop 200 aporta el colágeno y ácido hialurónico que, con la edad, se producen en menor cantidad. La vitamina C promueve la producción de colágeno para mantener la piel sana y protege a las células contra los radicales libres causados ??por la contaminación ambiental y los rayos UV.', 'url': 'https://www.farmaciagaleno.com/orthonat-hyalurotop-200-30-capsulas'}

数据字段

数据有几个字段:

  • url : 源的URL,字符串类型
  • text : 文本内容,字符串类型
  • timestamp : 时间戳,字符串类型

数据划分

西班牙语的mC4子集结果报告如下表所示:

config train
stepwise 50M
random 50M
gaussian 50M

验证划分与原始mc4数据集完全相同。

数据集创建

理由

通过 mc4-sampling 方法对原始 mc4 进行困惑度采样来构建此数据集的西班牙语部分。

附加信息

数据集创建者

原始数据由 Common Crawl 提供。

许可信息

AllenAI根据ODC-BY的条款发布此数据集。通过使用此数据集,您还受到有关数据集中内容的Common Crawl使用条款的约束。

引用信息

要引用此数据集( arXiv ):

@article{BERTIN,
    author = {Javier De la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury},
    title = {{BERTIN}: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling},
    journal = {Procesamiento del Lenguaje Natural},
    volume = {68},
    number = {0},
    year = {2022},
    keywords = {},
    abstract = {The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pretraining sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name perplexity sampling that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.},
    issn = {1989-7553},
    url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403},
    pages = {13--23}
}

如果您使用了这个数据集,我们很乐意听到您的使用情况!请在Twitter、GitHub、Discord上联系我们,或给我们发送电子邮件。

要引用原始的mc4数据集:

@article{2019t5,
    author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
    title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
    journal = {arXiv e-prints},
    year = {2019},
    archivePrefix = {arXiv},
    eprint = {1910.10683},
}

贡献

@versae 为BERTIN项目提供的数据集。

感谢 @dirkgr @lhoestq 添加原始的mC4数据集。