数据集:
bertin-project/mc4-es-sampled
该数据集是对mC4中西班牙语部分应用困惑度采样的结果,采用了 mc4-sampling 方法。请参考 BERTIN Project 。
您可以使用以下方法加载mC4西班牙语抽样:
from datasets import load_dataset for config in ("random", "stepwise", "gaussian"): mc4es = load_dataset( "bertin-project/mc4-es-sampled", config, split="train", streaming=True ).shuffle(buffer_size=1000) for sample in mc4es: print(config, sample) break
或者,您可以绕过datasets库,快速下载(根据连接速度约1.5小时)与预训练BERTIN模型使用相同顺序的特定配置,以大约(200GB)JSON行文件的形式:
import io import gzip import json import sys import requests from tqdm import tqdm _DATA_URL_TRAIN = "https://huggingface.co/datasets/bertin-project/mc4-es-sampled/resolve/main/mc4-es-train-50M-{config}-shard-{index:04d}-of-{n_shards:04d}.json.gz" def main(config="stepwise"): data_urls = [ _DATA_URL_TRAIN.format( config=config, index=index + 1, n_shards=1024, ) for index in range(1024) ] with open(f"mc4-es-train-50M-{config}.jsonl", "w") as f: for dara_url in tqdm(data_urls): response = requests.get(dara_url) bio = io.BytesIO(response.content) with gzip.open(bio, "rt", encoding="utf8") as g: for line in g: json_line = json.loads(line.strip()) f.write(json.dumps(json_line) + "\ ") if __name__ == "__main__": main(sys.argv[1])
mC4-es-sampled主要用于BERTIN项目的可重现性和在中等预算上预训练语言模型和词表示。
该数据集仅支持西班牙语。
来自Gaussian配置的示例:
{'timestamp': '2018-10-20T06:20:53Z', 'text': 'Ortho HyaluroTop 200 aporta el colágeno y ácido hialurónico que, con la edad, se producen en menor cantidad. La vitamina C promueve la producción de colágeno para mantener la piel sana y protege a las células contra los radicales libres causados ??por la contaminación ambiental y los rayos UV.', 'url': 'https://www.farmaciagaleno.com/orthonat-hyalurotop-200-30-capsulas'}
数据有几个字段:
西班牙语的mC4子集结果报告如下表所示:
config | train |
---|---|
stepwise | 50M |
random | 50M |
gaussian | 50M |
验证划分与原始mc4数据集完全相同。
通过 mc4-sampling 方法对原始 mc4 进行困惑度采样来构建此数据集的西班牙语部分。
原始数据由 Common Crawl 提供。
AllenAI根据ODC-BY的条款发布此数据集。通过使用此数据集,您还受到有关数据集中内容的Common Crawl使用条款的约束。
要引用此数据集( arXiv ):
@article{BERTIN, author = {Javier De la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury}, title = {{BERTIN}: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling}, journal = {Procesamiento del Lenguaje Natural}, volume = {68}, number = {0}, year = {2022}, keywords = {}, abstract = {The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pretraining sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name perplexity sampling that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403}, pages = {13--23} }
如果您使用了这个数据集,我们很乐意听到您的使用情况!请在Twitter、GitHub、Discord上联系我们,或给我们发送电子邮件。
要引用原始的mc4数据集:
@article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, }
@versae 为BERTIN项目提供的数据集。