数据集:

AmazonScience/mintaka

任务:

问答

子任务:

open-domain-qa

计算机处理:

ar de ja

大小:

100K<n<1M

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

许可:

cc-by-4.0

数据集介绍文件清单

英文

Mintaka: 一个复杂、自然和多语言的端到端问答数据集

数据集摘要

Mintaka是一个由MTurk工作者提取、用Wikidata问题和答案实体进行注释的复杂、自然和多语言问答（QA）数据集，包含20,000个问题-答案对。有关Mintaka数据集的详细信息可以在我们的论文中找到： https://aclanthology.org/2022.coling-1.138/

为了构建Mintaka，我们明确收集了8种复杂度类型的问题，以及通用问题：

计数（例如，Q：有多少宇航员当选国会议员？A：4）
比较（例如，Q：勃朗峰是否比雷尼尔山高？A：是）
最高级（例如，Q：饥饿游戏中最年轻的受害者是谁？A：Rue）
序数（例如，Q：谁是埃及最后一个托勒密统治者？A：克利奥帕特拉）
多跳（例如，Q：谁是赢得第50届超级碗的四分卫？A：佩顿·曼宁）
交集（例如，Q：由丹尼斯·维伦纽夫执导并由蒂莫西·柴勒梅德主演的电影是哪部？A：《沙丘》）
差异（例如，Q：哪个马里奥卡丁车游戏没有出现耀西？A：马里奥卡丁车实境赛：家庭赛道）
是/否（例如，Q：Lady Gaga是否和Ariana Grande合作过歌曲？A：是）
通用（例如，Q：迈克尔·菲尔普斯在哪里出生？A：马里兰州巴尔的摩）
我们收集了关于8个类别的问题：电影、音乐、体育、书籍、地理、政治、电子游戏和历史

Mintaka是第一个可以用于端到端问答模型的大规模复杂、自然和多语言数据集之一。

支持的任务和排行榜

该数据集可以用于训练问答模型。为了确保可比较性，请参考我们的评估脚本： https://github.com/amazon-science/mintaka#evaluation

语言

所有问题均以英语编写，并翻译为其他8种语言：阿拉伯语、法语、德语、印地语、意大利语、日语、葡萄牙语和西班牙语。

数据集结构

数据实例

“train”示例如下。

{
  "id": "a9011ddf",
  "lang": "en",
  "question": "What is the seventh tallest mountain in North America?",
  "answerText": "Mount Lucania",
  "category": "geography",
  "complexityType": "ordinal",
  "questionEntity":
  [
      {
          "name": "Q49",
          "entityType": "entity",
          "label": "North America",
          "mention": "North America",
          "span": [40, 53]
      },
      {
          "name": 7,
          "entityType": "ordinal",
          "mention": "seventh",
          "span": [12, 19]
      }
  ],
  "answerEntity":
  [
      {
          "name": "Q1153188",
          "label": "Mount Lucania",
      }
  ],
}

数据字段

所有拆分的数据字段相同。

id：给定样本的唯一ID。

lang：问题的语言。

question：对应语言中的原始问题。

answerText：用英语提取的原始答案文本。

category：问题的类别。选项有：地理、电影、历史、书籍、政治、音乐、电子游戏或体育。

complexityType：问题的复杂度类型。选项有：序数、交集、计数、最高级、是/否比较、多跳、差异或通用。

questionEntity：由众包工作者标注的注释问题实体列表。

{
     "name": The Wikidata Q-code or numerical value of the entity
     "entityType": The type of the entity. Options are:
             entity, cardinal, ordinal, date, time, percent, quantity, or money
     "label": The label of the Wikidata Q-code
     "mention": The entity as it appears in the English question text. Will be empty for non-English samples.
     "span": The start and end characters of the mention in the English question text. Will be empty for non-English samples.
}

answerEntity：由众包工作者标注的注释答案实体列表。

{
     "name": The Wikidata Q-code or numerical value of the entity
     "label": The label of the Wikidata Q-code
}

数据拆分

对于每种语言，我们将其分为训练集（14,000个样本）、验证集（2,000个样本）和测试集（4,000个样本）。

个人和敏感信息

该语料库不包含个人或敏感信息。

使用数据的注意事项

附加信息

数据集策划者

Amazon Alexa AI。

许可信息

该项目在CC-BY-4.0许可下发布。

引用信息

使用该数据集时，请引用以下论文。

@inproceedings{sen-etal-2022-mintaka,
    title = "Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering",
    author = "Sen, Priyanka  and
      Aji, Alham Fikri  and
      Saffari, Amir",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.138",
    pages = "1604--1619"
}

贡献者

感谢 @afaji 添加了这个数据集。

作者:

AmazonScience

数据集大小:

16.82 KB