数据集:
Muennighoff/xP3x
xP3x(跨语言公共提示池扩展版)是一个覆盖277种语言和16个NLP任务的提示和数据集收集。它包含了所有xP3的内容,以及更多!它用于训练Aya项目@ C4AI 未来的mT0和BLOOMZ竞争者。
Name | Explanation | Example models |
---|---|---|
1236321 | Mixture of 17 tasks in 277 languages with English prompts | WIP - Join us at Project Aya @ 1237321 to help! |
1238321 | Mixture of 13 training tasks in 46 languages with English prompts | 1239321 & 12310321 |
12311321 | Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English) | 12312321 & 12313321 |
12314321 | xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts | |
12315321 | 12316321 processed version of xP3 | 1239321 |
12318321 | Repreprocessed version of the English-only 12319321 with 8 training tasks | 12320321 & 12321321 |
示例如下:
{ 'inputs': '11月、遂にクロームはファイヤーフォックスを引き離し始めた。_はインターネットユーザーの評価が高まったのだ。\nReplace the _ in the above sentence with the correct option: \n- ファイヤーフォックス\n- クローム', 'targets': 'クローム', 'language': 'jpn_Jpan', 'split': 'test', 'template': 'Replace', 'dataset': 'Muennighoff/xwinograd', 'config': 'jp' }
数据字段在所有拆分中都相同:
该数据集有680GB和5.3亿个样本。根据需求进行筛选和去重。
按语言加载:
# pip install -q datasets from datasets import load_dataset ds = load_dataset("Muennighoff/xP3x", "zho_Hans", streaming=True) # Use streaming to not download all at once for x in ds["train"]: print(x) break
然后,您可以通过数据字段进行筛选,例如仅获取特定的配置或数据集。由于每个数据集-配置-模板都是一个单独的jsonl文件,因此您也可以根据您想要的数据集、配置和模板进行决定并仅下载它们。例如,要下载所有日语xwinograd样本,您可以执行:
# pip install -q datasets from datasets import load_dataset import multiprocessing # pip install --upgrade huggingface-hub from huggingface_hub import HfFileSystem, hf_hub_url fs = HfFileSystem() fps = fs.glob(f"datasets/Muennighoff/xP3x/data/jpn_Jpan/*xwinograd*") resolved_paths = [fs.resolve_path(file) for file in fps] data_files = [hf_hub_url(resolved_path.repo_id, resolved_path.path_in_repo, repo_type=resolved_path.repo_type) for resolved_path in resolved_paths] ds = load_dataset("json", data_files=data_files, num_proc=8)["train"]
Language | Code | Kilobytes | % | Samples | % |
---|---|---|---|---|---|
Emilian | egl_Latn | 104 | 0.0 | 402 | 0.0 |
Swiss German | gsw_Latn | 104 | 0.0 | 408 | 0.0 |
Novial | nov_Latn | 116 | 0.0 | 432 | 0.0 |
Ainu (Latin script) | ain_Latn | 120 | 0.0 | 410 | 0.0 |
Chamorro | cha_Latn | 120 | 0.0 | 452 | 0.0 |
Gothic | got_Goth | 120 | 0.0 | 402 | 0.0 |
Prussian | prg_Latn | 120 | 0.0 | 424 | 0.0 |
Picard | pcd_Latn | 140 | 0.0 | 530 | 0.0 |
Northern Frisian | frr_Latn | 156 | 0.0 | 554 | 0.0 |
Uzbek (Latin script) | uzb_Latn | 156 | 0.0 | 600 | 0.0 |
Ottoman Turkish (Latin script) | ota_Latn | 188 | 0.0 | 632 | 0.0 |
Swahili (macrolanguage) | swa_Latn | 212 | 0.0 | 772 | 0.0 |
Talossan | tzl_Latn | 220 | 0.0 | 836 | 0.0 |
Kven Finnish | fkv_Latn | 260 | 0.0 | 910 | 0.0 |
Zaza | zza_Latn | 260 | 0.0 | 1,056 | 0.0 |
Frisian | fry_Latn | 268 | 0.0 | 956 | 0.0 |
Piemontese | pms_Latn | 276 | 0.0 | 998 | 0.0 |
Kalmyk | xal_Cyrl | 288 | 0.0 | 976 | 0.0 |
Hunsrik | hrx_Latn | 352 | 0.0 | 1,380 | 0.0 |
Romany | rom_Latn | 364 | 0.0 | 1,410 | 0.0 |
Ancient Greek (to 1453) | grc_Grek | 392 | 0.0 | 1,226 | 0.0 |
Tase Naga | nst_Latn | 424 | 0.0 | 1,608 | 0.0 |
Albanian | sqi_Latn | 596 | 0.0 | 2,216 | 0.0 |
Guadeloupean Creole French | gcf_Latn | 608 | 0.0 | 2,326 | 0.0 |
Yakut | sah_Cyrl | 608 | 0.0 | 1,986 | 0.0 |
Ho (Latin script) | hoc_Latn | 632 | 0.0 | 2,634 | 0.0 |
Khasi | kha_Latn | 676 | 0.0 | 2,664 | 0.0 |
Algerian Arabic | arq_Arab | 688 | 0.0 | 2,278 | 0.0 |
Lower Sorbian | dsb_Latn | 692 | 0.0 | 2,596 | 0.0 |
Chuvash | chv_Cyrl | 716 | 0.0 | 2,446 | 0.0 |
Old Russian | orv_Cyrl | 752 | 0.0 | 2,586 | 0.0 |
Pampanga | pam_Latn | 784 | 0.0 | 2,984 | 0.0 |
Kurdish (Latin script) | kur_Latn | 796 | 0.0 | 3,050 | 0.0 |
Ottoman Turkish | ota_Arab | 832 | 0.0 | 2,772 | 0.0 |
Kotava | avk_Latn | 864 | 0.0 | 3,118 | 0.0 |
Upper Sorbian | hsb_Latn | 900 | 0.0 | 3,474 | 0.0 |
Buryat | bua_Cyrl | 924 | 0.0 | 3,218 | 0.0 |
Swabian | swg_Latn | 996 | 0.0 | 3,366 | 0.0 |
Coastal Kadazan | kzj_Latn | 1,136 | 0.0 | 3,766 | 0.0 |
Chavacano | cbk_Latn | 1,352 | 0.0 | 4,994 | 0.0 |
Quechua | que_Latn | 1,704 | 0.0 | 5,312 | 0.0 |
Lingua Franca Nova (Cyrillic script) | lfn_Cyrl | 1,740 | 0.0 | 5,458 | 0.0 |
Gronings | gos_Latn | 1,864 | 0.0 | 7,462 | 0.0 |
Volapük | vol_Latn | 1,948 | 0.0 | 7,712 | 0.0 |
Yue Chinese (Simplified) | yue_Hans | 2,300 | 0.0 | 7,872 | 0.0 |
Mari (Russia) | chm_Cyrl | 2,540 | 0.0 | 7,496 | 0.0 |
Kadazan Dusun | dtp_Latn | 2,548 | 0.0 | 8,892 | 0.0 |
Breton | bre_Latn | 3,048 | 0.0 | 11,868 | 0.0 |
Ladino | lad_Latn | 3,224 | 0.0 | 11,916 | 0.0 |
Cornish | cor_Latn | 3,492 | 0.0 | 13,880 | 0.0 |
Interlingue | ile_Latn | 3,700 | 0.0 | 14,468 | 0.0 |
Wu Chinese | wuu_Hans | 3,784 | 0.0 | 13,062 | 0.0 |
Japanese (Katakana) | jpn_Kana | 4,208 | 0.0 | 13,942 | 0.0 |
Ido | ido_Latn | 6,180 | 0.0 | 23,742 | 0.0 |
Yiddishi | yid_Hebr | 9,896 | 0.0 | 34,412 | 0.01 |
Klingon | tlh_Latn | 11,716 | 0.0 | 46,010 | 0.01 |
Lingua Franca Nova | lfn_Latn | 13,328 | 0.0 | 46,826 | 0.01 |
Lojban | jbo_Latn | 17,468 | 0.0 | 66,694 | 0.01 |
Low German | nds_Latn | 18,364 | 0.0 | 68,098 | 0.01 |
Interlingua (International Auxiliary Language Association) | ina_Latn | 25,700 | 0.0 | 76,584 | 0.01 |
Java | java | 25,904 | 0.0 | 13,551 | 0.0 |
Japanese (Kanji) | jpn_Hani | 26,292 | 0.0 | 89,978 | 0.02 |
Norwegian | nor_Latn | 26,724 | 0.0 | 93,116 | 0.02 |
Toki Pona | toki_Latn | 26,808 | 0.0 | 97,170 | 0.02 |
Latin | lat_Latn | 28,900 | 0.0 | 101,390 | 0.02 |
Serbo-Croatian | hbs_Latn | 29,452 | 0.0 | 105,748 | 0.02 |
Nigerian Pidgin | pcm_Latn | 145,872 | 0.02 | 88,992 | 0.02 |
Azerbaijani (South or North; Latin script) | aze_Latn | 147,564 | 0.02 | 77,875 | 0.01 |
Serbian (Latin script) | srp_Latn | 179,072 | 0.03 | 131,101 | 0.02 |
Japanese (Hiragana) | jpn_Hira | 188,944 | 0.03 | 628,758 | 0.12 |
Berber (Latin script) | ber_Latn | 201,464 | 0.03 | 693,602 | 0.13 |
Jupyter Notebook | jupyter_notebook | 416,056 | 0.06 | 400,000 | 0.08 |
Yue Chinese | yue_Hant | 613,352 | 0.09 | 1,227,429 | 0.23 |
Haitian Creole | hat_Latn | 629,420 | 0.09 | 1,228,281 | 0.23 |
Mossi | mos_Latn | 630,416 | 0.09 | 1,223,481 | 0.23 |
Pangasinan | pag_Latn | 630,684 | 0.09 | 1,223,481 | 0.23 |
Twi | twi_Latn | 631,172 | 0.09 | 1,223,481 | 0.23 |
Bosnian | bos_Latn | 633,016 | 0.09 | 1,224,479 | 0.23 |
Ewe | ewe_Latn | 633,292 | 0.09 | 1,223,481 | 0.23 |
Bambara | bam_Latn | 634,520 | 0.09 | 1,223,481 | 0.23 |
Javanese | jav_Latn | 635,248 | 0.09 | 1,224,003 | 0.23 |
Southwestern Dinka | dik_Latn | 635,416 | 0.09 | 1,223,481 | 0.23 |
Kabuverdianu | kea_Latn | 636,144 | 0.09 | 1,223,481 | 0.23 |
Dyula | dyu_Latn | 636,464 | 0.09 | 1,223,481 | 0.23 |
Venetian | vec_Latn | 637,412 | 0.09 | 1,223,481 | 0.23 |
Chokwe | cjk_Latn | 637,532 | 0.09 | 1,223,481 | 0.23 |
Latgalian | ltg_Latn | 637,612 | 0.09 | 1,223,481 | 0.23 |
Sundanese | sun_Latn | 638,120 | 0.09 | 1,223,481 | 0.23 |
Asturian | ast_Latn | 638,708 | 0.09 | 1,223,481 | 0.23 |
Akan | aka_Latn | 639,648 | 0.09 | 1,223,481 | 0.23 |
Mizo | lus_Latn | 639,680 | 0.09 | 1,223,481 | 0.23 |
Guarani | grn_Latn | 641,540 | 0.09 | 1,225,647 | 0.23 |
Limburgish | lim_Latn | 642,368 | 0.09 | 1,223,481 | 0.23 |
Faroese | fao_Latn | 642,432 | 0.09 | 1,224,067 | 0.23 |
Buginese | bug_Latn | 643,472 | 0.09 | 1,223,481 | 0.23 |
Sango | sag_Latn | 643,596 | 0.09 | 1,223,481 | 0.23 |
Luba-Kasai | lua_Latn | 643,640 | 0.09 | 1,223,481 | 0.23 |
Papiamento | pap_Latn | 643,648 | 0.09 | 1,223,481 | 0.23 |
Silesian | szl_Latn | 644,608 | 0.09 | 1,223,481 | 0.23 |
Sicilian | scn_Latn | 645,636 | 0.1 | 1,223,481 | 0.23 |
Kimbundu | kmb_Latn | 645,964 | 0.1 | 1,223,481 | 0.23 |
Basque | eus_Latn | 646,084 | 0.1 | 1,246,877 | 0.23 |
Balinese | ban_Latn | 646,408 | 0.1 | 1,223,481 | 0.23 |
Norwegian Nynorsk | nno_Latn | 646,996 | 0.1 | 1,229,699 | 0.23 |
Central Aymara | ayr_Latn | 647,236 | 0.1 | 1,223,481 | 0.23 |
Tamasheq (Latin script) | taq_Latn | 648,656 | 0.1 | 1,223,481 | 0.23 |
Kikongo | kon_Latn | 648,992 | 0.1 | 1,223,481 | 0.23 |
Friulian | fur_Latn | 649,272 | 0.1 | 1,223,481 | 0.23 |
Ayacucho Quechua | quy_Latn | 649,992 | 0.1 | 1,223,481 | 0.23 |
Maori | mri_Latn | 650,336 | 0.1 | 1,224,211 | 0.23 |
Icelandic | isl_Latn | 650,372 | 0.1 | 1,246,623 | 0.23 |
Galician | glg_Latn | 652,088 | 0.1 | 1,233,291 | 0.23 |
Catalan | cat_Latn | 652,116 | 0.1 | 1,241,381 | 0.23 |
Lombard | lmo_Latn | 652,120 | 0.1 | 1,223,481 | 0.23 |
Banjar (Latin script) | bjn_Latn | 652,372 | 0.1 | 1,223,481 | 0.23 |
Fijian | fij_Latn | 652,796 | 0.1 | 1,223,481 | 0.23 |
Crimean Tatar | crh_Latn | 653,920 | 0.1 | 1,223,895 | 0.23 |
Northern Kurdish | kmr_Latn | 654,108 | 0.1 | 1,223,481 | 0.23 |
Ligurian | lij_Latn | 654,432 | 0.1 | 1,223,481 | 0.23 |
Occitan | oci_Latn | 655,676 | 0.1 | 1,227,945 | 0.23 |
Turkmen | tuk_Latn | 658,672 | 0.1 | 1,241,205 | 0.23 |
Luxembourgish | ltz_Latn | 658,768 | 0.1 | 1,225,339 | 0.23 |
Cebuano | ceb_Latn | 659,124 | 0.1 | 1,226,039 | 0.23 |
Samoan | smo_Latn | 659,704 | 0.1 | 1,223,481 | 0.23 |
Sardinian | srd_Latn | 660,000 | 0.1 | 1,223,481 | 0.23 |
Bemba | bem_Latn | 660,504 | 0.1 | 1,223,481 | 0.23 |
Minangkabau (Latin script) | min_Latn | 660,672 | 0.1 | 1,223,481 | 0.23 |
Acehnese (Latin script) | ace_Latn | 661,084 | 0.1 | 1,223,481 | 0.23 |
Ilocano | ilo_Latn | 661,184 | 0.1 | 1,227,663 | 0.23 |
Irish | gle_Latn | 661,660 | 0.1 | 1,227,357 | 0.23 |
Fon | fon_Latn | 663,124 | 0.1 | 1,223,481 | 0.23 |
Waray | war_Latn | 664,120 | 0.1 | 1,226,503 | 0.23 |
Norwegian Bokmål | nob_Latn | 666,240 | 0.1 | 1,300,607 | 0.24 |
Tosk Albanian | als_Latn | 666,692 | 0.1 | 1,223,481 | 0.23 |
Standard Malay | zsm_Latn | 667,088 | 0.1 | 1,270,715 | 0.24 |
Southern Sotho | sot_Latn | 667,728 | 0.1 | 1,223,481 | 0.23 |
Kabyle | kab_Latn | 668,128 | 0.1 | 1,346,605 | 0.25 |
Jingpho | kac_Latn | 669,464 | 0.1 | 1,223,481 | 0.23 |
Lingala | lin_Latn | 670,428 | 0.1 | 1,323,481 | 0.25 |
Wolof | wol_Latn | 670,568 | 0.1 | 1,373,481 | 0.26 |
Central Kanuri (Latin script) | knc_Latn | 670,800 | 0.1 | 1,223,481 | 0.23 |
Kikuyu | kik_Latn | 672,096 | 0.1 | 1,223,481 | 0.23 |
Tok Pisin | tpi_Latn | 672,916 | 0.1 | 1,223,481 | 0.23 |
Nuer | nus_Latn | 673,632 | 0.1 | 1,223,481 | 0.23 |
Tagalog | tgl_Latn | 673,684 | 0.1 | 1,247,417 | 0.23 |
Tumbuka | tum_Latn | 676,948 | 0.1 | 1,223,481 | 0.23 |
Plateau Malagasy | plt_Latn | 677,852 | 0.1 | 1,223,481 | 0.23 |
Afrikaans | afr_Latn | 679,164 | 0.1 | 1,337,091 | 0.25 |
North Azerbaijani | azj_Latn | 679,820 | 0.1 | 1,223,481 | 0.23 |
Kabiyè | kbp_Latn | 684,880 | 0.1 | 1,223,481 | 0.23 |
Modern Standard Arabic (Romanized) | arb_Latn | 685,408 | 0.1 | 1,223,481 | 0.23 |
Scottish Gaelic | gla_Latn | 708,620 | 0.1 | 1,243,627 | 0.23 |
Sindhi | snd_Arab | 718,680 | 0.11 | 1,223,481 | 0.23 |
North Levantine Arabic | apc_Arab | 720,048 | 0.11 | 1,223,481 | 0.23 |
Tunisian Arabic | aeb_Arab | 720,360 | 0.11 | 1,223,481 | 0.23 |
South Levantine Arabic | ajp_Arab | 720,488 | 0.11 | 1,223,481 | 0.23 |
Dari | prs_Arab | 720,500 | 0.11 | 1,223,481 | 0.23 |
Moroccan Arabic | ary_Arab | 722,904 | 0.11 | 1,223,481 | 0.23 |
Egyptian Arabic | arz_Arab | 723,356 | 0.11 | 1,223,481 | 0.23 |
Najdi Arabic | ars_Arab | 725,784 | 0.11 | 1,223,481 | 0.23 |
Acehnese (Arabic script) | ace_Arab | 726,272 | 0.11 | 1,223,481 | 0.23 |
Mesopotamian Arabic | acm_Arab | 728,472 | 0.11 | 1,223,481 | 0.23 |
Ta’izzi-Adeni Arabic | acq_Arab | 734,780 | 0.11 | 1,223,481 | 0.23 |
South Azerbaijani | azb_Arab | 735,728 | 0.11 | 1,223,481 | 0.23 |
Central Kanuri (Arabic script) | knc_Arab | 746,936 | 0.11 | 1,223,481 | 0.23 |
Rundi | run_Latn | 749,792 | 0.11 | 1,296,111 | 0.24 |
Banjar (Arabic script) | bjn_Arab | 751,112 | 0.11 | 1,223,481 | 0.23 |
Central Kurdish | ckb_Arab | 756,804 | 0.11 | 1,223,481 | 0.23 |
Bashkir | bak_Cyrl | 758,816 | 0.11 | 1,223,481 | 0.23 |
Kashmiri (Arabic script) | kas_Arab | 759,140 | 0.11 | 1,223,481 | 0.23 |
Tatar | tat_Cyrl | 764,212 | 0.11 | 1,247,685 | 0.23 |
Minangkabau (Arabic script) | min_Arab | 765,384 | 0.11 | 1,223,481 | 0.23 |
Kazakh | kaz_Cyrl | 766,176 | 0.11 | 1,232,697 | 0.23 |
Halh Mongolian | khk_Cyrl | 776,384 | 0.11 | 1,224,353 | 0.23 |
Tajik | tgk_Cyrl | 780,452 | 0.11 | 1,223,481 | 0.23 |
Eastern Yiddish | ydd_Hebr | 781,452 | 0.12 | 1,223,481 | 0.23 |
Uyghur | uig_Arab | 785,444 | 0.12 | 1,256,999 | 0.24 |
Armenian | hye_Armn | 789,952 | 0.12 | 1,228,171 | 0.23 |
Hebrew | heb_Hebr | 793,144 | 0.12 | 1,604,365 | 0.3 |
Belarusian | bel_Cyrl | 806,588 | 0.12 | 1,261,197 | 0.24 |
Macedonian | mkd_Cyrl | 813,436 | 0.12 | 1,384,567 | 0.26 |
Welsh | cym_Latn | 821,036 | 0.12 | 1,321,455 | 0.25 |
Northern Uzbek | uzn_Latn | 835,560 | 0.12 | 1,273,404 | 0.24 |
Central Atlas Tamazight | tzm_Tfng | 843,508 | 0.12 | 1,223,481 | 0.23 |
Tamasheq (Tifinagh script) | taq_Tfng | 848,104 | 0.12 | 1,223,481 | 0.23 |
Magahi | mag_Deva | 851,360 | 0.13 | 1,223,481 | 0.23 |
Bhojpuri | bho_Deva | 854,848 | 0.13 | 1,223,481 | 0.23 |
Awadhi | awa_Deva | 857,096 | 0.13 | 1,224,037 | 0.23 |
Chhattisgarhi | hne_Deva | 859,332 | 0.13 | 1,223,481 | 0.23 |
Kyrgyz | kir_Cyrl | 860,700 | 0.13 | 1,250,163 | 0.23 |
Maithili | mai_Deva | 863,476 | 0.13 | 1,223,481 | 0.23 |
Assamese | asm_Beng | 865,904 | 0.13 | 1,223,481 | 0.23 |
Kashmiri (Devanagari script) | kas_Deva | 867,232 | 0.13 | 1,223,481 | 0.23 |
Sanskrit | san_Deva | 879,236 | 0.13 | 1,223,481 | 0.23 |
Lao | lao_Laoo | 888,240 | 0.13 | 1,223,481 | 0.23 |
Odia | ory_Orya | 890,508 | 0.13 | 1,223,481 | 0.23 |
Santali | sat_Olck | 902,300 | 0.13 | 1,223,481 | 0.23 |
Kannada | kan_Knda | 909,260 | 0.13 | 1,223,481 | 0.23 |
Meitei (Bengali script) | mni_Beng | 917,984 | 0.14 | 1,223,481 | 0.23 |
Georgian | kat_Geor | 928,712 | 0.14 | 1,226,729 | 0.23 |
Kamba | kam_Latn | 936,468 | 0.14 | 2,136,615 | 0.4 |
Tigrinya | tir_Ethi | 949,608 | 0.14 | 1,276,536 | 0.24 |
Swati | ssw_Latn | 950,564 | 0.14 | 2,195,002 | 0.41 |
Malayalam | mal_Mlym | 953,984 | 0.14 | 1,225,083 | 0.23 |
Nigerian Fulfulde | fuv_Latn | 956,328 | 0.14 | 2,126,652 | 0.4 |
Umbundu | umb_Latn | 974,104 | 0.14 | 2,264,553 | 0.43 |
Ganda | lug_Latn | 975,780 | 0.14 | 2,273,481 | 0.43 |
Northern Sotho | nso_Latn | 978,484 | 0.14 | 2,250,971 | 0.42 |
Khmer | khm_Khmr | 984,756 | 0.14 | 1,227,825 | 0.23 |
Luo | luo_Latn | 993,068 | 0.15 | 2,249,242 | 0.42 |
Standard Tibetan | bod_Tibt | 993,732 | 0.15 | 1,223,481 | 0.23 |
Tswana | tsn_Latn | 1,009,328 | 0.15 | 2,323,481 | 0.44 |
Kinyarwanda | kin_Latn | 1,010,752 | 0.15 | 2,273,481 | 0.43 |
Sinhala | sin_Sinh | 1,012,012 | 0.15 | 1,256,582 | 0.24 |
Xhosa | xho_Latn | 1,019,804 | 0.15 | 2,323,481 | 0.44 |
Shona | sna_Latn | 1,026,320 | 0.15 | 2,273,481 | 0.43 |
Esperanto | epo_Latn | 1,029,444 | 0.15 | 2,612,083 | 0.49 |
Tsonga | tso_Latn | 1,031,856 | 0.15 | 2,323,481 | 0.44 |
Dzongkha | dzo_Tibt | 1,033,552 | 0.15 | 1,223,481 | 0.23 |
Zulu | zul_Latn | 1,039,296 | 0.15 | 2,323,481 | 0.44 |
Serbian | srp_Cyrl | 1,040,024 | 0.15 | 1,362,598 | 0.26 |
Nyanja | nya_Latn | 1,061,780 | 0.16 | 2,323,481 | 0.44 |
Shan | shn_Mymr | 1,074,940 | 0.16 | 1,223,481 | 0.23 |
Igbo | ibo_Latn | 1,095,300 | 0.16 | 2,282,301 | 0.43 |
Hausa | hau_Latn | 1,112,272 | 0.16 | 2,335,738 | 0.44 |
West Central Oromo | gaz_Latn | 1,115,600 | 0.16 | 2,343,260 | 0.44 |
Nepali | npi_Deva | 1,144,676 | 0.17 | 1,281,430 | 0.24 |
Yoruba | yor_Latn | 1,164,540 | 0.17 | 2,334,801 | 0.44 |
Southern Pashto | pbt_Arab | 1,170,840 | 0.17 | 1,365,533 | 0.26 |
Somali | som_Latn | 1,198,320 | 0.18 | 2,482,437 | 0.47 |
Burmese | mya_Mymr | 1,228,196 | 0.18 | 1,279,882 | 0.24 |
Amharic | amh_Ethi | 1,261,128 | 0.19 | 1,980,215 | 0.37 |
Eastern Panjabi | pan_Guru | 1,305,636 | 0.19 | 1,307,897 | 0.25 |
Gujarati | guj_Gujr | 1,331,780 | 0.2 | 1,317,314 | 0.25 |
Marathi | mar_Deva | 1,494,024 | 0.22 | 1,443,950 | 0.27 |
Bengali | ben_Beng | 1,650,272 | 0.24 | 1,411,514 | 0.27 |
Chinese (Traditional) | zho_Hant | 1,778,736 | 0.26 | 1,956,189 | 0.37 |
Tamil | tam_Taml | 1,833,328 | 0.27 | 1,394,473 | 0.26 |
Swahili | swh_Latn | 1,970,784 | 0.29 | 4,185,608 | 0.79 |
Telugu | tel_Telu | 2,224,480 | 0.33 | 1,573,325 | 0.3 |
Ukrainian | ukr_Cyrl | 2,227,616 | 0.33 | 2,216,119 | 0.42 |
Western Persian | pes_Arab | 2,389,340 | 0.35 | 1,811,121 | 0.34 |
Turkish | tur_Latn | 3,106,600 | 0.46 | 4,146,153 | 0.78 |
Urdu | urd_Arab | 3,553,960 | 0.52 | 3,513,218 | 0.66 |
Korean | kor_Hang | 4,642,468 | 0.68 | 3,415,920 | 0.64 |
Python | python | 4,728,504 | 0.7 | 3,142,962 | 0.59 |
Japanese | jpn_Jpan | 5,079,788 | 0.75 | 4,193,570 | 0.79 |
Thai | tha_Thai | 6,860,704 | 1.01 | 4,666,299 | 0.88 |
Chinese (Simplified) | zho_Hans | 8,063,684 | 1.19 | 7,355,509 | 1.38 |
Vietnamese | vie_Latn | 8,398,824 | 1.24 | 6,194,925 | 1.16 |
Indonesian | ind_Latn | 9,380,144 | 1.38 | 5,301,812 | 1.0 |
Hindi | hin_Deva | 9,914,328 | 1.46 | 5,612,176 | 1.05 |
Croatian | hrv_Latn | 10,028,028 | 1.48 | 5,583,975 | 1.05 |
Modern Standard Arabic | arb_Arab | 11,051,064 | 1.63 | 7,232,551 | 1.36 |
Romanian | ron_Latn | 11,441,636 | 1.68 | 5,594,927 | 1.05 |
Maltese | mlt_Latn | 11,614,488 | 1.71 | 5,513,885 | 1.04 |
Slovenian | slv_Latn | 12,014,912 | 1.77 | 5,533,689 | 1.04 |
Estonian | est_Latn | 12,126,212 | 1.79 | 5,584,057 | 1.05 |
Lithuanian | lit_Latn | 12,253,976 | 1.8 | 5,603,047 | 1.05 |
Slovak | slk_Latn | 12,286,300 | 1.81 | 5,513,481 | 1.04 |
Standard Latvian | lvs_Latn | 12,298,584 | 1.81 | 5,517,287 | 1.04 |
Polish | pol_Latn | 12,409,684 | 1.83 | 5,868,631 | 1.1 |
Hungarian | hun_Latn | 12,607,420 | 1.86 | 6,086,621 | 1.14 |
Russian | rus_Cyrl | 13,110,908 | 1.93 | 8,798,927 | 1.65 |
Czech | ces_Latn | 14,316,052 | 2.11 | 6,418,462 | 1.21 |
Bulgarian | bul_Cyrl | 14,615,468 | 2.15 | 7,265,885 | 1.37 |
Swedish | swe_Latn | 14,646,656 | 2.16 | 5,634,363 | 1.06 |
Finnish | fin_Latn | 15,011,464 | 2.21 | 6,077,501 | 1.14 |
Danish | dan_Latn | 16,136,612 | 2.38 | 5,831,109 | 1.1 |
Dutch | nld_Latn | 22,387,020 | 3.3 | 8,992,864 | 1.69 |
Greek | ell_Grek | 23,144,296 | 3.41 | 7,224,001 | 1.36 |
Italian | ita_Latn | 23,952,824 | 3.53 | 9,967,738 | 1.87 |
Portuguese | por_Latn | 27,297,252 | 4.02 | 11,242,808 | 2.11 |
German | deu_Latn | 27,909,808 | 4.11 | 15,806,969 | 2.97 |
French | fra_Latn | 28,428,608 | 4.18 | 16,365,984 | 3.08 |
Spanish | spa_Latn | 30,969,580 | 4.56 | 16,315,928 | 3.07 |
English | eng_Latn | 69,530,384 | 10.24 | 53,015,690 | 9.96 |
Total | - | 679,318,704 | 100 | 532,107,156 | 100 |
数据集收集在Apache 2.0下发布。请注意,各个数据集可能具有不同的许可证。
@article{muennighoff2022crosslingual, title={Crosslingual generalization through multitask finetuning}, author={Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang and Roberts, Adam and Biderman, Stella and Scao, Teven Le and Bari, M Saiful and Shen, Sheng and Yong, Zheng-Xin and Schoelkopf, Hailey and others}, journal={arXiv preprint arXiv:2211.01786}, year={2022} }
感谢 promptsource 的贡献者为该数据集添加了许多提示。感谢Aya团队@ C4AI ?