数据集:
bigscience/xP3
xP3 (跨语言公共提示池) 是一个涵盖46种语言和16个自然语言处理任务的提示和数据集的集合。它用于训练 BLOOMZ 和 mT0,这些多语言语言模型可以零-shot地按照人类指令执行多种语言的任务。
Name | Explanation | Example models |
---|---|---|
1234321 | Mixture of 17 tasks in 277 languages with English prompts | WIP - Join us at Project Aya @ 1235321 to help! |
1236321 | Mixture of 13 training tasks in 46 languages with English prompts | 1237321 & 1238321 |
1239321 | Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English) | 12310321 & 12311321 |
12312321 | xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts | |
12313321 | 12314321 processed version of xP3 | 1237321 |
12316321 | Repreprocessed version of the English-only 12317321 with 8 training tasks | 12318321 & 12319321 |
"train" 的示例如下:
{ "inputs": "Sentence 1: Fue académico en literatura metafísica, teología y ciencias clásicas.\nSentence 2: Fue académico en literatura metafísica, teología y ciencia clásica.\nQuestion: Can we rewrite Sentence 1 to Sentence 2? Yes or No?", "targets": "Yes" }
所有拆分之间的数据字段是相同的:
下表总结了每种语言的大小(从 merged_{lang}.jsonl 文件计算得出)。由于诸如 tw 等语言只是来自 Flores 的单句翻译样本,因此它们的字节百分比远低于其样本百分比。添加新语言非常简单,您可以使用 this script adding Russian 作为示例。
Language | Kilobytes | % | Samples | % |
---|---|---|---|---|
tw | 106288 | 0.11 | 265071 | 0.34 |
bm | 107056 | 0.11 | 265180 | 0.34 |
ak | 108096 | 0.11 | 265071 | 0.34 |
eu | 108112 | 0.11 | 269973 | 0.34 |
ca | 110608 | 0.12 | 271191 | 0.34 |
fon | 113072 | 0.12 | 265063 | 0.34 |
st | 114080 | 0.12 | 265063 | 0.34 |
ki | 115040 | 0.12 | 265180 | 0.34 |
tum | 116032 | 0.12 | 265063 | 0.34 |
wo | 122560 | 0.13 | 365063 | 0.46 |
ln | 126304 | 0.13 | 365060 | 0.46 |
as | 156256 | 0.16 | 265063 | 0.34 |
or | 161472 | 0.17 | 265063 | 0.34 |
kn | 165456 | 0.17 | 265063 | 0.34 |
ml | 175040 | 0.18 | 265864 | 0.34 |
rn | 192992 | 0.2 | 318189 | 0.4 |
nso | 229712 | 0.24 | 915051 | 1.16 |
tn | 235536 | 0.25 | 915054 | 1.16 |
lg | 235936 | 0.25 | 915021 | 1.16 |
rw | 249360 | 0.26 | 915043 | 1.16 |
ts | 250256 | 0.26 | 915044 | 1.16 |
sn | 252496 | 0.27 | 865056 | 1.1 |
xh | 254672 | 0.27 | 915058 | 1.16 |
zu | 263712 | 0.28 | 915061 | 1.16 |
ny | 272128 | 0.29 | 915063 | 1.16 |
ig | 325232 | 0.34 | 950097 | 1.2 |
yo | 352784 | 0.37 | 918416 | 1.16 |
ne | 393680 | 0.41 | 315754 | 0.4 |
pa | 523248 | 0.55 | 339210 | 0.43 |
gu | 560688 | 0.59 | 347499 | 0.44 |
sw | 560896 | 0.59 | 1114455 | 1.41 |
mr | 666240 | 0.7 | 417269 | 0.53 |
bn | 832720 | 0.88 | 428843 | 0.54 |
ta | 924496 | 0.97 | 410633 | 0.52 |
te | 1332912 | 1.4 | 573364 | 0.73 |
ur | 1918272 | 2.02 | 855756 | 1.08 |
vi | 3101408 | 3.27 | 1667306 | 2.11 |
code | 4330752 | 4.56 | 2707724 | 3.43 |
hi | 4393696 | 4.63 | 1543441 | 1.96 |
zh | 4589904 | 4.83 | 3560556 | 4.51 |
id | 4606288 | 4.85 | 2627392 | 3.33 |
ar | 4677264 | 4.93 | 2148955 | 2.72 |
fr | 5546688 | 5.84 | 5055942 | 6.41 |
pt | 6129584 | 6.46 | 3562772 | 4.52 |
es | 7571808 | 7.98 | 5151349 | 6.53 |
en | 37261104 | 39.25 | 31495184 | 39.93 |
total | 94941936 | 100.0 | 78883588 | 100.0 |
该数据集在 Apache 2.0 许可下发布。
@article{muennighoff2022crosslingual, title={Crosslingual generalization through multitask finetuning}, author={Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang and Roberts, Adam and Biderman, Stella and Scao, Teven Le and Bari, M Saiful and Shen, Sheng and Yong, Zheng-Xin and Schoelkopf, Hailey and others}, journal={arXiv preprint arXiv:2211.01786}, year={2022} }
感谢 promptsource 的贡献者添加了该数据集中使用的许多提示。