数据集:
bigscience/xP3mt
xP3(跨语言公共提示池)是一个跨越46种语言和16种NLP任务的提示和数据集的集合。它用于训练BLOOMZ和mT0,这些多语言语言模型能够零-shot方式遵循大量语言中的人类指令。
Name | Explanation | Example models |
---|---|---|
1234321 | Mixture of 17 tasks in 277 languages with English prompts | WIP - Join us at Project Aya @ 1235321 to help! |
1236321 | Mixture of 13 training tasks in 46 languages with English prompts | 1237321 & 1238321 |
1239321 | Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English) | 12310321 & 12311321 |
12312321 | xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts | |
12313321 | 12314321 processed version of xP3 | 1237321 |
12316321 | Repreprocessed version of the English-only 12317321 with 8 training tasks | 12318321 & 12319321 |
“train”的一个示例如下:
{ "inputs": "Oración 1: Fue académico en literatura metafísica, teología y ciencias clásicas.\Oración 2: Fue académico en literatura metafísica, teología y ciencia clásica.\nPregunta: ¿La oración 1 parafrasea la oración 2? ¿Si o no?", "targets": "Sí" }
所有拆分的数据字段相同:
下表总结了每种语言的大小(根据merged_{lang}.jsonl文件计算)。由于类似tw的语言只是来自Flores的单句翻译样本,它们的字节百分比显著低于其样本百分比。我们对单语数据集进行了机器翻译,因此只有具有跨语言数据集(例如翻译)的语言没有英文提示。没有非英文提示的语言相当于 xP3 。
Language | Kilobytes | % | Samples | % | Non-English prompts |
---|---|---|---|---|---|
tw | 106288 | 0.11 | 265071 | 0.33 | |
bm | 107056 | 0.11 | 265180 | 0.33 | |
ak | 108096 | 0.11 | 265071 | 0.33 | |
ca | 110608 | 0.11 | 271191 | 0.34 | |
eu | 113008 | 0.12 | 281199 | 0.35 | |
fon | 113072 | 0.12 | 265063 | 0.33 | |
st | 114080 | 0.12 | 265063 | 0.33 | |
ki | 115040 | 0.12 | 265180 | 0.33 | |
tum | 116032 | 0.12 | 265063 | 0.33 | |
wo | 122560 | 0.13 | 365063 | 0.46 | |
ln | 126304 | 0.13 | 365060 | 0.46 | |
as | 156256 | 0.16 | 265063 | 0.33 | |
or | 161472 | 0.17 | 265063 | 0.33 | |
kn | 165456 | 0.17 | 265063 | 0.33 | |
ml | 175040 | 0.18 | 265864 | 0.33 | |
rn | 192992 | 0.2 | 318189 | 0.4 | |
nso | 229712 | 0.24 | 915051 | 1.14 | |
tn | 235536 | 0.24 | 915054 | 1.14 | |
lg | 235936 | 0.24 | 915021 | 1.14 | |
rw | 249360 | 0.26 | 915043 | 1.14 | |
ts | 250256 | 0.26 | 915044 | 1.14 | |
sn | 252496 | 0.26 | 865056 | 1.08 | |
xh | 254672 | 0.26 | 915058 | 1.14 | |
zu | 263712 | 0.27 | 915061 | 1.14 | |
ny | 272128 | 0.28 | 915063 | 1.14 | |
ig | 325440 | 0.33 | 950097 | 1.19 | ✅ |
yo | 339664 | 0.35 | 913021 | 1.14 | ✅ |
ne | 398144 | 0.41 | 315754 | 0.39 | ✅ |
pa | 529632 | 0.55 | 339210 | 0.42 | ✅ |
sw | 561392 | 0.58 | 1114439 | 1.39 | ✅ |
gu | 566576 | 0.58 | 347499 | 0.43 | ✅ |
mr | 674000 | 0.69 | 417269 | 0.52 | ✅ |
bn | 854864 | 0.88 | 428725 | 0.54 | ✅ |
ta | 943440 | 0.97 | 410633 | 0.51 | ✅ |
te | 1384016 | 1.42 | 573354 | 0.72 | ✅ |
ur | 1944416 | 2.0 | 855756 | 1.07 | ✅ |
vi | 3113184 | 3.2 | 1667306 | 2.08 | ✅ |
code | 4330752 | 4.46 | 2707724 | 3.38 | |
hi | 4469712 | 4.6 | 1543441 | 1.93 | ✅ |
id | 4538768 | 4.67 | 2582272 | 3.22 | ✅ |
zh | 4604112 | 4.74 | 3571636 | 4.46 | ✅ |
ar | 4703968 | 4.84 | 2148970 | 2.68 | ✅ |
fr | 5558912 | 5.72 | 5055942 | 6.31 | ✅ |
pt | 6130016 | 6.31 | 3562772 | 4.45 | ✅ |
es | 7579424 | 7.8 | 5151349 | 6.43 | ✅ |
en | 39252528 | 40.4 | 32740750 | 40.87 | |
total | 97150128 | 100.0 | 80100816 | 100.0 | ✅ |
该数据集使用Apache 2.0许可发布。
@misc{muennighoff2022crosslingual, title={Crosslingual Generalization through Multitask Finetuning}, author={Niklas Muennighoff and Thomas Wang and Lintang Sutawika and Adam Roberts and Stella Biderman and Teven Le Scao and M Saiful Bari and Sheng Shen and Zheng-Xin Yong and Hailey Schoelkopf and Xiangru Tang and Dragomir Radev and Alham Fikri Aji and Khalid Almubarak and Samuel Albanie and Zaid Alyafeai and Albert Webson and Edward Raff and Colin Raffel}, year={2022}, eprint={2211.01786}, archivePrefix={arXiv}, primaryClass={cs.CL} }
感谢 promptsource 的贡献者添加了此数据集中使用的许多提示。