数据集:
khalidalt/tydiqa-goldp
任务:
问答子任务:
extractive-qa计算机处理:
multilingual语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
extended|wikipedia许可:
apache-2.0TyDi QA 是一个包含204K个问题-答案对的问答数据集,涵盖了11种语言的各种类型。TyDi QA的语言多样性很高,涵盖了各种语言表达的语言特性集合,我们期望在这个数据集上表现良好的模型能够对世界上众多语言进行泛化。该数据集中包含了英语专用语料库中找不到的语言现象。为了提供一个真实的信息获取任务并避免启动效应,问题是由想要了解答案但还不知道答案的人编写的(与SQuAD及其后代不同),数据是直接在每种语言中收集而来的(与MLQA和XQuAD不同,它们使用了翻译)。
“验证”示例如下。
This example was too long and was cropped: { "annotations": { "minimal_answers_end_byte": [-1, -1, -1], "minimal_answers_start_byte": [-1, -1, -1], "passage_answer_candidate_index": [-1, -1, -1], "yes_no_answer": ["NONE", "NONE", "NONE"] }, "document_plaintext": "\"\\nรองศาสตราจารย์[1] หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร (22 กันยายน 2495 -) ผู้ว่าราชการกรุงเทพมหานครคนที่ 15 อดีตรองหัวหน้าพรรคปร...", "document_title": "หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร", "document_url": "\"https://th.wikipedia.org/wiki/%E0%B8%AB%E0%B8%A1%E0%B9%88%E0%B8%AD%E0%B8%A1%E0%B8%A3%E0%B8%B2%E0%B8%8A%E0%B8%A7%E0%B8%87%E0%B8%...", "language": "thai", "passage_answer_candidates": "{\"plaintext_end_byte\": [494, 1779, 2931, 3904, 4506, 5588, 6383, 7122, 8224, 9375, 10473, 12563, 15134, 17765, 19863, 21902, 229...", "question_text": "\"หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร เรียนจบจากที่ไหน ?\"..." }secondary_task
“验证”示例如下。
This example was too long and was cropped: { "answers": { "answer_start": [394], "text": ["بطولتين"] }, "context": "\"أقيمت البطولة 21 مرة، شارك في النهائيات 78 دولة، وعدد الفرق التي فازت بالبطولة حتى الآن 8 فرق، ويعد المنتخب البرازيلي الأكثر تت...", "id": "arabic-2387335860751143628-1", "question": "\"كم عدد مرات فوز الأوروغواي ببطولة كاس العالم لكرو القدم؟\"...", "title": "قائمة نهائيات كأس العالم" }
所有拆分之间的数据字段相同。
primary_taskname | train | validation |
---|---|---|
primary_task | 166916 | 18670 |
secondary_task | 49881 | 5077 |
@article{tydiqa, title = {TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages}, author = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki} year = {2020}, journal = {Transactions of the Association for Computational Linguistics} }
@inproceedings{ruder-etal-2021-xtreme, title = "{XTREME}-{R}: Towards More Challenging and Nuanced Multilingual Evaluation", author = "Ruder, Sebastian and Constant, Noah and Botha, Jan and Siddhant, Aditya and Firat, Orhan and Fu, Jinlan and Liu, Pengfei and Hu, Junjie and Garrette, Dan and Neubig, Graham and Johnson, Melvin", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.802", doi = "10.18653/v1/2021.emnlp-main.802", pages = "10215--10245", } }