数据集:

bigscience/xP3all

中文

Dataset Card for xP3

Dataset Summary

xP3 (Crosslingual Public Pool of Prompts) is a collection of prompts & datasets across 46 of languages & 16 NLP tasks. It is used for the training of BLOOMZ and mT0, multilingual language models capable of following human instructions in dozens of languages zero-shot.

  • Creation: The dataset can be recreated using instructions available here . We provide this version to save processing time and ease reproducibility.
  • Languages: 46 (Can be extended by recreating with more splits )
  • xP3 Dataset Family:
Name Explanation Example models
xP3x Mixture of 17 tasks in 277 languages with English prompts WIP - Join us at Project Aya @ C4AI to help!
xP3 Mixture of 13 training tasks in 46 languages with English prompts bloomz & mt0-xxl
xP3mt Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English) bloomz-mt & mt0-xxl-mt
xP3all xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts
xP3megds Megatron-DeepSpeed processed version of xP3 bloomz
P3 Repreprocessed version of the English-only P3 with 8 training tasks bloomz-p3 & mt0-xxl-p3

Dataset Structure

Data Instances

An example of "train" looks as follows:

{
"inputs": "Sentence 1: Fue académico en literatura metafísica, teología y ciencias clásicas.\nSentence 2: Fue académico en literatura metafísica, teología y ciencia clásica.\nQuestion: Can we rewrite Sentence 1 to Sentence 2? Yes or No?",
"targets": "Yes" 
}

Data Fields

The data fields are the same among all splits:

  • inputs : the natural language input fed to the model
  • targets : the natural language target that the model has to generate

Data Splits

The below table summarizes sizes per language (computed from the merged_{lang}.jsonl files). Due to languages like tw only being single sentence translation samples from Flores, their byte percentage is significantly lower than their sample percentage.

Language Kilobytes % Samples %
tw 106288 0.11 265071 0.33
bm 107056 0.11 265180 0.33
ak 108096 0.11 265071 0.33
ca 110608 0.11 271191 0.33
eu 113008 0.11 281199 0.35
fon 113072 0.11 265063 0.33
st 114080 0.11 265063 0.33
ki 115040 0.12 265180 0.33
tum 116032 0.12 265063 0.33
wo 122560 0.12 365063 0.45
ln 126304 0.13 365060 0.45
as 156256 0.16 265063 0.33
or 161472 0.16 265063 0.33
kn 165456 0.17 265063 0.33
ml 175040 0.18 265864 0.33
rn 192992 0.19 318189 0.39
nso 229712 0.23 915051 1.13
tn 235536 0.24 915054 1.13
lg 235936 0.24 915021 1.13
rw 249360 0.25 915043 1.13
ts 250256 0.25 915044 1.13
sn 252496 0.25 865056 1.07
xh 254672 0.26 915058 1.13
zu 263712 0.26 915061 1.13
ny 272128 0.27 915063 1.13
ig 325232 0.33 950097 1.17
yo 352784 0.35 918416 1.13
ne 393680 0.39 315754 0.39
pa 523248 0.52 339210 0.42
gu 560688 0.56 347499 0.43
sw 566656 0.57 1130481 1.4
mr 666240 0.67 417269 0.52
bn 832720 0.83 428843 0.53
ta 926912 0.93 415433 0.51
te 1343232 1.35 584590 0.72
ur 1918272 1.92 855756 1.06
vi 3102512 3.11 1672106 2.07
code 4330752 4.34 2707724 3.34
hi 4403568 4.41 1554667 1.92
zh 4599440 4.61 3589234 4.43
id 4612256 4.62 2643418 3.27
ar 4683456 4.69 2160181 2.67
fr 6591120 6.6 5316403 6.57
pt 6886800 6.9 3752156 4.63
es 8587920 8.6 5413205 6.69
en 39252528 39.33 32740750 40.44
total 99807184 100.0 80956089 100.0

Dataset Creation

Source Data

Training datasets Evaluation datasets (included in xP3all except for HumanEval) Additional xP3all datasets

Additional Information

Licensing Information

The dataset is released under Apache 2.0.

Citation Information

@misc{muennighoff2022crosslingual,
      title={Crosslingual Generalization through Multitask Finetuning}, 
      author={Niklas Muennighoff and Thomas Wang and Lintang Sutawika and Adam Roberts and Stella Biderman and Teven Le Scao and M Saiful Bari and Sheng Shen and Zheng-Xin Yong and Hailey Schoelkopf and Xiangru Tang and Dragomir Radev and Alham Fikri Aji and Khalid Almubarak and Samuel Albanie and Zaid Alyafeai and Albert Webson and Edward Raff and Colin Raffel},
      year={2022},
      eprint={2211.01786},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to the contributors of promptsource for adding many prompts used in this dataset.