数据集:

xquad_r

任务:

问答

子任务:

extractive-qa

计算机处理:

multilingual

大小:

1K<n<10K

语言创建人:

found

批注创建人:

expert-generated

预印本库:

arxiv:2004.05484
英文

数据集卡片:[数据集名称]

数据集摘要

XQuAD-R 是 XQuAD 数据集的检索版本(一个跨语言抽取问答数据集)。与 XQuAD 类似,XQUAD-R 是一个包含11种语言的平行数据集,每个问题在这11种不同语言中出现,并且在这些语言中有11个平行的正确答案。

支持的任务和排行榜

[需要更多信息]

语言

可在以下语言中找到数据集:

  • 阿拉伯语:xquad-r/ar.json
  • 德语:xquad-r/de.json
  • 希腊语:xquad-r/el.json
  • 英语:xquad-r/en.json
  • 西班牙语:xquad-r/es.json
  • 印地语:xquad-r/hi.json
  • 俄语:xquad-r/ru.json
  • 泰语:xquad-r/th.json
  • 土耳其语:xquad-r/tr.json
  • 越南语:xquad-r/vi.json
  • 中文:xquad-r/zh.json

数据集结构

[需要更多信息]

数据实例

一个来自en配置的示例:

{'id': '56beb4343aeaaa14008c925b',
 'context': "The Panthers defense gave up just 308 points, ranking sixth in the league, while also leading the NFL in interceptions with 24 and boasting four Pro Bowl selections. Pro Bowl defensive tackle Kawann Short led the team in sacks with 11, while also forcing three fumbles and recovering two. Fellow lineman Mario Addison added 6½ sacks. The Panthers line also featured veteran defensive end Jared Allen, a 5-time pro bowler who was the NFL's active career sack leader with 136, along with defensive end Kony Ealy, who had 5 sacks in just 9 starts. Behind them, two of the Panthers three starting linebackers were also selected to play in the Pro Bowl: Thomas Davis and Luke Kuechly. Davis compiled 5½ sacks, four forced fumbles, and four interceptions, while Kuechly led the team in tackles (118) forced two fumbles, and intercepted four passes of his own. Carolina's secondary featured Pro Bowl safety Kurt Coleman, who led the team with a career high seven interceptions, while also racking up 88 tackles and Pro Bowl cornerback Josh Norman, who developed into a shutdown corner during the season and had four interceptions, two of which were returned for touchdowns.",
 'question': 'How many points did the Panthers defense surrender?',
 'answers': {'text': ['308'], 'answer_start': [34]}}

数据字段

  • id (str): 上下文-问题对的唯一标识符。
  • context (str): 问题的上下文。
  • question (str): 问题。
  • answers (dict): 具有以下键的答案:
    • text (str 的列表): 答案的文本。
    • answer_start (int 的列表): 每个答案文本的起始位置。

数据切分

XQuAD-R 的每个语言的问题和候选句子数量如下表所示:

XQuAD-R
questions candidates
ar 1190 1222
de 1190 1276
el 1190 1234
en 1190 1180
es 1190 1215
hi 1190 1244
ru 1190 1219
th 1190 852
tr 1190 1167
vi 1190 1209
zh 1190 1196

数据集创建

[需要更多信息]

策划原理

[需要更多信息]

来源数据

[需要更多信息]

最初的数据收集和规范化

[需要更多信息]

谁是源语言制作人?

[需要更多信息]

注释

[需要更多信息]

注释过程

[需要更多信息]

谁是注释者?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据时考虑的因素

[需要更多信息]

数据集的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

[需要更多信息]

数据集策划者

该数据集最初由Uma Roy,Noah Constant,Rami Al-Rfou,Aditya Barua,Aaron Phillips和Yinfei Yang在Google Research进行的工作期间创建。

许可信息

XQuAD-R 在 CC BY-SA 4.0 license 下进行发布。

引用信息

@article{roy2020lareqa,
  title={LAReQA: Language-agnostic answer retrieval from a multilingual pool},
  author={Roy, Uma and Constant, Noah and Al-Rfou, Rami and Barua, Aditya and Phillips, Aaron and Yang, Yinfei},
  journal={arXiv preprint arXiv:2004.05484},
  year={2020}
}

贡献者

感谢 @manandey 添加了该数据集。