数据集:

Muennighoff/xP3x

英文

xP3x 数据集卡片

数据集摘要

xP3x(跨语言公共提示池扩展版)是一个覆盖277种语言和16个NLP任务的提示和数据集收集。它包含了所有xP3的内容,以及更多!它用于训练Aya项目@ C4AI 未来的mT0和BLOOMZ竞争者。

  • 创建:可以使用提供的指令 here 以及此存储库中的文件xp3x_create.py重新创建数据集。我们提供此版本以节省处理时间。
  • 语言:277
  • xP3数据集系列:
Name Explanation Example models
1236321 Mixture of 17 tasks in 277 languages with English prompts WIP - Join us at Project Aya @ 1237321 to help!
1238321 Mixture of 13 training tasks in 46 languages with English prompts 1239321 & 12310321
12311321 Mixture of 13 training tasks in 46 languages with prompts in 20 languages (machine-translated from English) 12312321 & 12313321
12314321 xP3 + evaluation datasets adding an additional 3 tasks for a total of 16 tasks in 46 languages with English prompts
12315321 12316321 processed version of xP3 1239321
12318321 Repreprocessed version of the English-only 12319321 with 8 training tasks 12320321 & 12321321

数据集结构

数据实例

示例如下:

{
  'inputs': '11月、遂にクロームはファイヤーフォックスを引き離し始めた。_はインターネットユーザーの評価が高まったのだ。\nReplace the _ in the above sentence with the correct option: \n- ファイヤーフォックス\n- クローム',
  'targets': 'クローム',
  'language': 'jpn_Jpan',
  'split': 'test',
  'template': 'Replace',
  'dataset': 'Muennighoff/xwinograd',
  'config': 'jp'
}

数据字段

数据字段在所有拆分中都相同:

  • 输入:输入到模型的自然语言输入
  • 目标:模型必须生成的自然语言目标
  • 语言:语言代码。这些代码是FLORES-200代码的扩展,其中第一部分是语言代码,第二部分是脚本代码。
  • 模板:使用的提示名称。
  • 数据集:数据来源的Hugging Face数据集标识符。
  • 配置:Hugging Face数据集的配置。

用法

该数据集有680GB和5.3亿个样本。根据需求进行筛选和去重。

按语言加载:

# pip install -q datasets
from datasets import load_dataset
ds = load_dataset("Muennighoff/xP3x", "zho_Hans", streaming=True) # Use streaming to not download all at once
for x in ds["train"]:
    print(x)
    break

然后,您可以通过数据字段进行筛选,例如仅获取特定的配置或数据集。由于每个数据集-配置-模板都是一个单独的jsonl文件,因此您也可以根据您想要的数据集、配置和模板进行决定并仅下载它们。例如,要下载所有日语xwinograd样本,您可以执行:

# pip install -q datasets
from datasets import load_dataset
import multiprocessing
# pip install --upgrade huggingface-hub
from huggingface_hub import HfFileSystem, hf_hub_url

fs = HfFileSystem()
fps = fs.glob(f"datasets/Muennighoff/xP3x/data/jpn_Jpan/*xwinograd*")
resolved_paths = [fs.resolve_path(file) for file in fps]
data_files = [hf_hub_url(resolved_path.repo_id, resolved_path.path_in_repo, repo_type=resolved_path.repo_type) for resolved_path in resolved_paths]

ds = load_dataset("json", data_files=data_files, num_proc=8)["train"]

数据拆分

Language Code Kilobytes % Samples %
Emilian egl_Latn 104 0.0 402 0.0
Swiss German gsw_Latn 104 0.0 408 0.0
Novial nov_Latn 116 0.0 432 0.0
Ainu (Latin script) ain_Latn 120 0.0 410 0.0
Chamorro cha_Latn 120 0.0 452 0.0
Gothic got_Goth 120 0.0 402 0.0
Prussian prg_Latn 120 0.0 424 0.0
Picard pcd_Latn 140 0.0 530 0.0
Northern Frisian frr_Latn 156 0.0 554 0.0
Uzbek (Latin script) uzb_Latn 156 0.0 600 0.0
Ottoman Turkish (Latin script) ota_Latn 188 0.0 632 0.0
Swahili (macrolanguage) swa_Latn 212 0.0 772 0.0
Talossan tzl_Latn 220 0.0 836 0.0
Kven Finnish fkv_Latn 260 0.0 910 0.0
Zaza zza_Latn 260 0.0 1,056 0.0
Frisian fry_Latn 268 0.0 956 0.0
Piemontese pms_Latn 276 0.0 998 0.0
Kalmyk xal_Cyrl 288 0.0 976 0.0
Hunsrik hrx_Latn 352 0.0 1,380 0.0
Romany rom_Latn 364 0.0 1,410 0.0
Ancient Greek (to 1453) grc_Grek 392 0.0 1,226 0.0
Tase Naga nst_Latn 424 0.0 1,608 0.0
Albanian sqi_Latn 596 0.0 2,216 0.0
Guadeloupean Creole French gcf_Latn 608 0.0 2,326 0.0
Yakut sah_Cyrl 608 0.0 1,986 0.0
Ho (Latin script) hoc_Latn 632 0.0 2,634 0.0
Khasi kha_Latn 676 0.0 2,664 0.0
Algerian Arabic arq_Arab 688 0.0 2,278 0.0
Lower Sorbian dsb_Latn 692 0.0 2,596 0.0
Chuvash chv_Cyrl 716 0.0 2,446 0.0
Old Russian orv_Cyrl 752 0.0 2,586 0.0
Pampanga pam_Latn 784 0.0 2,984 0.0
Kurdish (Latin script) kur_Latn 796 0.0 3,050 0.0
Ottoman Turkish ota_Arab 832 0.0 2,772 0.0
Kotava avk_Latn 864 0.0 3,118 0.0
Upper Sorbian hsb_Latn 900 0.0 3,474 0.0
Buryat bua_Cyrl 924 0.0 3,218 0.0
Swabian swg_Latn 996 0.0 3,366 0.0
Coastal Kadazan kzj_Latn 1,136 0.0 3,766 0.0
Chavacano cbk_Latn 1,352 0.0 4,994 0.0
Quechua que_Latn 1,704 0.0 5,312 0.0
Lingua Franca Nova (Cyrillic script) lfn_Cyrl 1,740 0.0 5,458 0.0
Gronings gos_Latn 1,864 0.0 7,462 0.0
Volapük vol_Latn 1,948 0.0 7,712 0.0
Yue Chinese (Simplified) yue_Hans 2,300 0.0 7,872 0.0
Mari (Russia) chm_Cyrl 2,540 0.0 7,496 0.0
Kadazan Dusun dtp_Latn 2,548 0.0 8,892 0.0
Breton bre_Latn 3,048 0.0 11,868 0.0
Ladino lad_Latn 3,224 0.0 11,916 0.0
Cornish cor_Latn 3,492 0.0 13,880 0.0
Interlingue ile_Latn 3,700 0.0 14,468 0.0
Wu Chinese wuu_Hans 3,784 0.0 13,062 0.0
Japanese (Katakana) jpn_Kana 4,208 0.0 13,942 0.0
Ido ido_Latn 6,180 0.0 23,742 0.0
Yiddishi yid_Hebr 9,896 0.0 34,412 0.01
Klingon tlh_Latn 11,716 0.0 46,010 0.01
Lingua Franca Nova lfn_Latn 13,328 0.0 46,826 0.01
Lojban jbo_Latn 17,468 0.0 66,694 0.01
Low German nds_Latn 18,364 0.0 68,098 0.01
Interlingua (International Auxiliary Language Association) ina_Latn 25,700 0.0 76,584 0.01
Java java 25,904 0.0 13,551 0.0
Japanese (Kanji) jpn_Hani 26,292 0.0 89,978 0.02
Norwegian nor_Latn 26,724 0.0 93,116 0.02
Toki Pona toki_Latn 26,808 0.0 97,170 0.02
Latin lat_Latn 28,900 0.0 101,390 0.02
Serbo-Croatian hbs_Latn 29,452 0.0 105,748 0.02
Nigerian Pidgin pcm_Latn 145,872 0.02 88,992 0.02
Azerbaijani (South or North; Latin script) aze_Latn 147,564 0.02 77,875 0.01
Serbian (Latin script) srp_Latn 179,072 0.03 131,101 0.02
Japanese (Hiragana) jpn_Hira 188,944 0.03 628,758 0.12
Berber (Latin script) ber_Latn 201,464 0.03 693,602 0.13
Jupyter Notebook jupyter_notebook 416,056 0.06 400,000 0.08
Yue Chinese yue_Hant 613,352 0.09 1,227,429 0.23
Haitian Creole hat_Latn 629,420 0.09 1,228,281 0.23
Mossi mos_Latn 630,416 0.09 1,223,481 0.23
Pangasinan pag_Latn 630,684 0.09 1,223,481 0.23
Twi twi_Latn 631,172 0.09 1,223,481 0.23
Bosnian bos_Latn 633,016 0.09 1,224,479 0.23
Ewe ewe_Latn 633,292 0.09 1,223,481 0.23
Bambara bam_Latn 634,520 0.09 1,223,481 0.23
Javanese jav_Latn 635,248 0.09 1,224,003 0.23
Southwestern Dinka dik_Latn 635,416 0.09 1,223,481 0.23
Kabuverdianu kea_Latn 636,144 0.09 1,223,481 0.23
Dyula dyu_Latn 636,464 0.09 1,223,481 0.23
Venetian vec_Latn 637,412 0.09 1,223,481 0.23
Chokwe cjk_Latn 637,532 0.09 1,223,481 0.23
Latgalian ltg_Latn 637,612 0.09 1,223,481 0.23
Sundanese sun_Latn 638,120 0.09 1,223,481 0.23
Asturian ast_Latn 638,708 0.09 1,223,481 0.23
Akan aka_Latn 639,648 0.09 1,223,481 0.23
Mizo lus_Latn 639,680 0.09 1,223,481 0.23
Guarani grn_Latn 641,540 0.09 1,225,647 0.23
Limburgish lim_Latn 642,368 0.09 1,223,481 0.23
Faroese fao_Latn 642,432 0.09 1,224,067 0.23
Buginese bug_Latn 643,472 0.09 1,223,481 0.23
Sango sag_Latn 643,596 0.09 1,223,481 0.23
Luba-Kasai lua_Latn 643,640 0.09 1,223,481 0.23
Papiamento pap_Latn 643,648 0.09 1,223,481 0.23
Silesian szl_Latn 644,608 0.09 1,223,481 0.23
Sicilian scn_Latn 645,636 0.1 1,223,481 0.23
Kimbundu kmb_Latn 645,964 0.1 1,223,481 0.23
Basque eus_Latn 646,084 0.1 1,246,877 0.23
Balinese ban_Latn 646,408 0.1 1,223,481 0.23
Norwegian Nynorsk nno_Latn 646,996 0.1 1,229,699 0.23
Central Aymara ayr_Latn 647,236 0.1 1,223,481 0.23
Tamasheq (Latin script) taq_Latn 648,656 0.1 1,223,481 0.23
Kikongo kon_Latn 648,992 0.1 1,223,481 0.23
Friulian fur_Latn 649,272 0.1 1,223,481 0.23
Ayacucho Quechua quy_Latn 649,992 0.1 1,223,481 0.23
Maori mri_Latn 650,336 0.1 1,224,211 0.23
Icelandic isl_Latn 650,372 0.1 1,246,623 0.23
Galician glg_Latn 652,088 0.1 1,233,291 0.23
Catalan cat_Latn 652,116 0.1 1,241,381 0.23
Lombard lmo_Latn 652,120 0.1 1,223,481 0.23
Banjar (Latin script) bjn_Latn 652,372 0.1 1,223,481 0.23
Fijian fij_Latn 652,796 0.1 1,223,481 0.23
Crimean Tatar crh_Latn 653,920 0.1 1,223,895 0.23
Northern Kurdish kmr_Latn 654,108 0.1 1,223,481 0.23
Ligurian lij_Latn 654,432 0.1 1,223,481 0.23
Occitan oci_Latn 655,676 0.1 1,227,945 0.23
Turkmen tuk_Latn 658,672 0.1 1,241,205 0.23
Luxembourgish ltz_Latn 658,768 0.1 1,225,339 0.23
Cebuano ceb_Latn 659,124 0.1 1,226,039 0.23
Samoan smo_Latn 659,704 0.1 1,223,481 0.23
Sardinian srd_Latn 660,000 0.1 1,223,481 0.23
Bemba bem_Latn 660,504 0.1 1,223,481 0.23
Minangkabau (Latin script) min_Latn 660,672 0.1 1,223,481 0.23
Acehnese (Latin script) ace_Latn 661,084 0.1 1,223,481 0.23
Ilocano ilo_Latn 661,184 0.1 1,227,663 0.23
Irish gle_Latn 661,660 0.1 1,227,357 0.23
Fon fon_Latn 663,124 0.1 1,223,481 0.23
Waray war_Latn 664,120 0.1 1,226,503 0.23
Norwegian Bokmål nob_Latn 666,240 0.1 1,300,607 0.24
Tosk Albanian als_Latn 666,692 0.1 1,223,481 0.23
Standard Malay zsm_Latn 667,088 0.1 1,270,715 0.24
Southern Sotho sot_Latn 667,728 0.1 1,223,481 0.23
Kabyle kab_Latn 668,128 0.1 1,346,605 0.25
Jingpho kac_Latn 669,464 0.1 1,223,481 0.23
Lingala lin_Latn 670,428 0.1 1,323,481 0.25
Wolof wol_Latn 670,568 0.1 1,373,481 0.26
Central Kanuri (Latin script) knc_Latn 670,800 0.1 1,223,481 0.23
Kikuyu kik_Latn 672,096 0.1 1,223,481 0.23
Tok Pisin tpi_Latn 672,916 0.1 1,223,481 0.23
Nuer nus_Latn 673,632 0.1 1,223,481 0.23
Tagalog tgl_Latn 673,684 0.1 1,247,417 0.23
Tumbuka tum_Latn 676,948 0.1 1,223,481 0.23
Plateau Malagasy plt_Latn 677,852 0.1 1,223,481 0.23
Afrikaans afr_Latn 679,164 0.1 1,337,091 0.25
North Azerbaijani azj_Latn 679,820 0.1 1,223,481 0.23
Kabiyè kbp_Latn 684,880 0.1 1,223,481 0.23
Modern Standard Arabic (Romanized) arb_Latn 685,408 0.1 1,223,481 0.23
Scottish Gaelic gla_Latn 708,620 0.1 1,243,627 0.23
Sindhi snd_Arab 718,680 0.11 1,223,481 0.23
North Levantine Arabic apc_Arab 720,048 0.11 1,223,481 0.23
Tunisian Arabic aeb_Arab 720,360 0.11 1,223,481 0.23
South Levantine Arabic ajp_Arab 720,488 0.11 1,223,481 0.23
Dari prs_Arab 720,500 0.11 1,223,481 0.23
Moroccan Arabic ary_Arab 722,904 0.11 1,223,481 0.23
Egyptian Arabic arz_Arab 723,356 0.11 1,223,481 0.23
Najdi Arabic ars_Arab 725,784 0.11 1,223,481 0.23
Acehnese (Arabic script) ace_Arab 726,272 0.11 1,223,481 0.23
Mesopotamian Arabic acm_Arab 728,472 0.11 1,223,481 0.23
Ta’izzi-Adeni Arabic acq_Arab 734,780 0.11 1,223,481 0.23
South Azerbaijani azb_Arab 735,728 0.11 1,223,481 0.23
Central Kanuri (Arabic script) knc_Arab 746,936 0.11 1,223,481 0.23
Rundi run_Latn 749,792 0.11 1,296,111 0.24
Banjar (Arabic script) bjn_Arab 751,112 0.11 1,223,481 0.23
Central Kurdish ckb_Arab 756,804 0.11 1,223,481 0.23
Bashkir bak_Cyrl 758,816 0.11 1,223,481 0.23
Kashmiri (Arabic script) kas_Arab 759,140 0.11 1,223,481 0.23
Tatar tat_Cyrl 764,212 0.11 1,247,685 0.23
Minangkabau (Arabic script) min_Arab 765,384 0.11 1,223,481 0.23
Kazakh kaz_Cyrl 766,176 0.11 1,232,697 0.23
Halh Mongolian khk_Cyrl 776,384 0.11 1,224,353 0.23
Tajik tgk_Cyrl 780,452 0.11 1,223,481 0.23
Eastern Yiddish ydd_Hebr 781,452 0.12 1,223,481 0.23
Uyghur uig_Arab 785,444 0.12 1,256,999 0.24
Armenian hye_Armn 789,952 0.12 1,228,171 0.23
Hebrew heb_Hebr 793,144 0.12 1,604,365 0.3
Belarusian bel_Cyrl 806,588 0.12 1,261,197 0.24
Macedonian mkd_Cyrl 813,436 0.12 1,384,567 0.26
Welsh cym_Latn 821,036 0.12 1,321,455 0.25
Northern Uzbek uzn_Latn 835,560 0.12 1,273,404 0.24
Central Atlas Tamazight tzm_Tfng 843,508 0.12 1,223,481 0.23
Tamasheq (Tifinagh script) taq_Tfng 848,104 0.12 1,223,481 0.23
Magahi mag_Deva 851,360 0.13 1,223,481 0.23
Bhojpuri bho_Deva 854,848 0.13 1,223,481 0.23
Awadhi awa_Deva 857,096 0.13 1,224,037 0.23
Chhattisgarhi hne_Deva 859,332 0.13 1,223,481 0.23
Kyrgyz kir_Cyrl 860,700 0.13 1,250,163 0.23
Maithili mai_Deva 863,476 0.13 1,223,481 0.23
Assamese asm_Beng 865,904 0.13 1,223,481 0.23
Kashmiri (Devanagari script) kas_Deva 867,232 0.13 1,223,481 0.23
Sanskrit san_Deva 879,236 0.13 1,223,481 0.23
Lao lao_Laoo 888,240 0.13 1,223,481 0.23
Odia ory_Orya 890,508 0.13 1,223,481 0.23
Santali sat_Olck 902,300 0.13 1,223,481 0.23
Kannada kan_Knda 909,260 0.13 1,223,481 0.23
Meitei (Bengali script) mni_Beng 917,984 0.14 1,223,481 0.23
Georgian kat_Geor 928,712 0.14 1,226,729 0.23
Kamba kam_Latn 936,468 0.14 2,136,615 0.4
Tigrinya tir_Ethi 949,608 0.14 1,276,536 0.24
Swati ssw_Latn 950,564 0.14 2,195,002 0.41
Malayalam mal_Mlym 953,984 0.14 1,225,083 0.23
Nigerian Fulfulde fuv_Latn 956,328 0.14 2,126,652 0.4
Umbundu umb_Latn 974,104 0.14 2,264,553 0.43
Ganda lug_Latn 975,780 0.14 2,273,481 0.43
Northern Sotho nso_Latn 978,484 0.14 2,250,971 0.42
Khmer khm_Khmr 984,756 0.14 1,227,825 0.23
Luo luo_Latn 993,068 0.15 2,249,242 0.42
Standard Tibetan bod_Tibt 993,732 0.15 1,223,481 0.23
Tswana tsn_Latn 1,009,328 0.15 2,323,481 0.44
Kinyarwanda kin_Latn 1,010,752 0.15 2,273,481 0.43
Sinhala sin_Sinh 1,012,012 0.15 1,256,582 0.24
Xhosa xho_Latn 1,019,804 0.15 2,323,481 0.44
Shona sna_Latn 1,026,320 0.15 2,273,481 0.43
Esperanto epo_Latn 1,029,444 0.15 2,612,083 0.49
Tsonga tso_Latn 1,031,856 0.15 2,323,481 0.44
Dzongkha dzo_Tibt 1,033,552 0.15 1,223,481 0.23
Zulu zul_Latn 1,039,296 0.15 2,323,481 0.44
Serbian srp_Cyrl 1,040,024 0.15 1,362,598 0.26
Nyanja nya_Latn 1,061,780 0.16 2,323,481 0.44
Shan shn_Mymr 1,074,940 0.16 1,223,481 0.23
Igbo ibo_Latn 1,095,300 0.16 2,282,301 0.43
Hausa hau_Latn 1,112,272 0.16 2,335,738 0.44
West Central Oromo gaz_Latn 1,115,600 0.16 2,343,260 0.44
Nepali npi_Deva 1,144,676 0.17 1,281,430 0.24
Yoruba yor_Latn 1,164,540 0.17 2,334,801 0.44
Southern Pashto pbt_Arab 1,170,840 0.17 1,365,533 0.26
Somali som_Latn 1,198,320 0.18 2,482,437 0.47
Burmese mya_Mymr 1,228,196 0.18 1,279,882 0.24
Amharic amh_Ethi 1,261,128 0.19 1,980,215 0.37
Eastern Panjabi pan_Guru 1,305,636 0.19 1,307,897 0.25
Gujarati guj_Gujr 1,331,780 0.2 1,317,314 0.25
Marathi mar_Deva 1,494,024 0.22 1,443,950 0.27
Bengali ben_Beng 1,650,272 0.24 1,411,514 0.27
Chinese (Traditional) zho_Hant 1,778,736 0.26 1,956,189 0.37
Tamil tam_Taml 1,833,328 0.27 1,394,473 0.26
Swahili swh_Latn 1,970,784 0.29 4,185,608 0.79
Telugu tel_Telu 2,224,480 0.33 1,573,325 0.3
Ukrainian ukr_Cyrl 2,227,616 0.33 2,216,119 0.42
Western Persian pes_Arab 2,389,340 0.35 1,811,121 0.34
Turkish tur_Latn 3,106,600 0.46 4,146,153 0.78
Urdu urd_Arab 3,553,960 0.52 3,513,218 0.66
Korean kor_Hang 4,642,468 0.68 3,415,920 0.64
Python python 4,728,504 0.7 3,142,962 0.59
Japanese jpn_Jpan 5,079,788 0.75 4,193,570 0.79
Thai tha_Thai 6,860,704 1.01 4,666,299 0.88
Chinese (Simplified) zho_Hans 8,063,684 1.19 7,355,509 1.38
Vietnamese vie_Latn 8,398,824 1.24 6,194,925 1.16
Indonesian ind_Latn 9,380,144 1.38 5,301,812 1.0
Hindi hin_Deva 9,914,328 1.46 5,612,176 1.05
Croatian hrv_Latn 10,028,028 1.48 5,583,975 1.05
Modern Standard Arabic arb_Arab 11,051,064 1.63 7,232,551 1.36
Romanian ron_Latn 11,441,636 1.68 5,594,927 1.05
Maltese mlt_Latn 11,614,488 1.71 5,513,885 1.04
Slovenian slv_Latn 12,014,912 1.77 5,533,689 1.04
Estonian est_Latn 12,126,212 1.79 5,584,057 1.05
Lithuanian lit_Latn 12,253,976 1.8 5,603,047 1.05
Slovak slk_Latn 12,286,300 1.81 5,513,481 1.04
Standard Latvian lvs_Latn 12,298,584 1.81 5,517,287 1.04
Polish pol_Latn 12,409,684 1.83 5,868,631 1.1
Hungarian hun_Latn 12,607,420 1.86 6,086,621 1.14
Russian rus_Cyrl 13,110,908 1.93 8,798,927 1.65
Czech ces_Latn 14,316,052 2.11 6,418,462 1.21
Bulgarian bul_Cyrl 14,615,468 2.15 7,265,885 1.37
Swedish swe_Latn 14,646,656 2.16 5,634,363 1.06
Finnish fin_Latn 15,011,464 2.21 6,077,501 1.14
Danish dan_Latn 16,136,612 2.38 5,831,109 1.1
Dutch nld_Latn 22,387,020 3.3 8,992,864 1.69
Greek ell_Grek 23,144,296 3.41 7,224,001 1.36
Italian ita_Latn 23,952,824 3.53 9,967,738 1.87
Portuguese por_Latn 27,297,252 4.02 11,242,808 2.11
German deu_Latn 27,909,808 4.11 15,806,969 2.97
French fra_Latn 28,428,608 4.18 16,365,984 3.08
Spanish spa_Latn 30,969,580 4.56 16,315,928 3.07
English eng_Latn 69,530,384 10.24 53,015,690 9.96
Total - 679,318,704 100 532,107,156 100
语言特定信息
  • 日语:在jpn_Hira、jpn_Kana、jpn_Hani中的数据的每个样本中都包含平假名、片假名或汉字。但是,它们仍然可能包含其他样式。因此,虽然jpn_Kana中的所有样本都保证有片假名,但仍然可能包含平假名或汉字。

数据集创建

源数据

训练数据集 数据集特定信息
  • Flores-200:Flores有三个提示:continuation、question、command,分别代表三种常用的提示风格,即使得提示看起来像是自然的延续,转化为问题或命令模型执行某些操作。
  • tatoeba_mt:包含重复项。例如,它具有同时被分类为jpn_Kana和jpn_Jpan的数据,因此您可能需要去重。

其他信息

许可信息

数据集收集在Apache 2.0下发布。请注意,各个数据集可能具有不同的许可证。

引用信息

@article{muennighoff2022crosslingual,
  title={Crosslingual generalization through multitask finetuning},
  author={Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang and Roberts, Adam and Biderman, Stella and Scao, Teven Le and Bari, M Saiful and Shen, Sheng and Yong, Zheng-Xin and Schoelkopf, Hailey and others},
  journal={arXiv preprint arXiv:2211.01786},
  year={2022}
}

贡献

感谢 promptsource 的贡献者为该数据集添加了许多提示。感谢Aya团队@ C4AI ?