数据集:
masakhane/afriqa
AfriQA是首个以非洲语言为重点的跨语言问答(QA)数据集。该数据集包含10种非洲语言的12,000多个异或QA示例,对于开发更具公平性的QA技术是一个宝贵的资源。
对于所有10种语言,都提供了训练/验证/测试集。
有20种语言可用:
{ "id": 0, "question": "Bushe icaalo ca Egypt caali tekwapo ne caalo cimbi?", "translated_question": "Has the country of Egypt been colonized before?", "answers": "['Emukwai']", "lang": "bem", "split": "dev", "translated_answer": "['yes']", "translation_type": "human_translation" }
对于所有语言,都有三个拆分。
原始拆分命名为train,dev和test,对应训练、验证和测试拆分。
拆分大小如下:
Language | train | dev | test |
---|---|---|---|
Bemba | 502 | 503 | 314 |
Fon | 427 | 428 | 386 |
Hausa | 435 | 436 | 300 |
Igbo | 417 | 418 | 409 |
Kinyarwanda | 407 | 409 | 347 |
Swahili | 415 | 417 | 302 |
Twi | 451 | 452 | 490 |
Wolof | 503 | 504 | 334 |
Yoruba | 360 | 361 | 332 |
Zulu | 387 | 388 | 325 |
Total | 4333 | 4346 | 3560 |
创建此数据集是为了向自然语言处理服务较少的10种语言介绍问答资源。
[需要更多信息]
...
初始数据收集和规范化...
谁是源语言提供者?...
详细信息可在此处找到...
标注者是谁?标注员是从 Masakhane 招募的
...
[需要更多信息]
[需要更多信息]
用户应注意数据集仅包含新闻文本,这可能限制所开发系统的适用性到其他领域。
数据的许可状态为CC 4.0非商业
提供数据集的 BibTex 格式化引用。例如:
@misc{ogundepo2023afriqa, title={AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages}, author={Odunayo Ogundepo and Tajuddeen R. Gwadabe and Clara E. Rivera and Jonathan H. Clark and Sebastian Ruder and David Ifeoluwa Adelani and Bonaventure F. P. Dossou and Abdou Aziz DIOP and Claytone Sikasote and Gilles Hacheme and Happy Buzaaba and Ignatius Ezeani and Rooweither Mabuya and Salomey Osei and Chris Emezue and Albert Njoroge Kahira and Shamsuddeen H. Muhammad and Akintunde Oladipo and Abraham Toluwase Owodunni and Atnafu Lambebo Tonja and Iyanuoluwa Shode and Akari Asai and Tunde Oluwaseyi Ajayi and Clemencia Siro and Steven Arthur and Mofetoluwa Adeyemi and Orevaoghene Ahia and Aremu Anuoluwapo and Oyinkansola Awosan and Chiamaka Chukwuneke and Bernard Opoku and Awokoya Ayodele and Verrah Otiende and Christine Mwase and Boyd Sinkala and Andre Niyongabo Rubungo and Daniel A. Ajisafe and Emeka Felix Onwuegbuzia and Habib Mbow and Emile Niyomutabazi and Eunice Mukonde and Falalu Ibrahim Lawan and Ibrahim Said Ahmad and Jesujoba O. Alabi and Martin Namukombo and Mbonu Chinedu and Mofya Phiri and Neo Putini and Ndumiso Mngoma and Priscilla A. Amuok and Ruqayya Nasir Iro and Sonia Adhiambo}, year={2023}, eprint={2305.06897}, archivePrefix={arXiv}, primaryClass={cs.CL} }
感谢 @ToluClassics 添加此数据集。