数据集:

duorc

语言:

en

计算机处理:

monolingual

语言创建人:

crowdsourced

批注创建人:

crowdsourced

源数据集:

original

预印本库:

arxiv:1804.07927

许可:

mit
英文

duorc的数据集卡

数据集摘要

DuoRC数据集是一个用于英语的问题和答案数据集,这些问题和答案是从众包AMT工人在维基百科和IMDb电影情节上收集来的。工人们可以自由地从情节中选择答案或合成他们自己的答案。它包含两个子数据集 - SelfRC和ParaphraseRC。 SelfRC数据集仅基于维基百科电影情节构建。ParaphraseRC的问题是基于相应的IMDb电影情节从维基百科电影情节编写的,并给出了相应的答案。

支持的任务和排行榜

  • 抽象问答:该数据集可用于训练抽象问答模型。抽象问答模型提供一个段落和一个问题,并期望生成一个多词答案。模型的性能通过精确匹配和F1得分来衡量,类似于 SQuAD V1.1 SQuAD V2 。可以使用具有 dense retriever BART-based model dense retriever 来完成此任务。

  • 抽取问答:该数据集可用于训练抽取问答模型。抽取问答模型提供一个段落和一个问题,并期望预测答案在段落中的起始和终止位置。模型的性能通过精确匹配和F1得分来衡量,类似于 SQuAD V1.1 SQuAD V2 。可使用 BertForQuestionAnswering 或任何其他类似的模型完成此任务。

语言

数据集中的文本为英语,由维基百科电影情节的撰写人员使用。相关的BCP-47代码为en。

数据集结构

数据实例

{'answers': ['They arrived by train.'], 'no_answer': False, 'plot': "200 years in the future, Mars has been colonized by a high-tech company.\nMelanie Ballard (Natasha Henstridge) arrives by train to a Mars mining camp which has cut all communication links with the company headquarters. She's not alone, as she is with a group of fellow police officers. They find the mining camp deserted except for a person in the prison, Desolation Williams (Ice Cube), who seems to laugh about them because they are all going to die. They were supposed to take Desolation to headquarters, but decide to explore first to find out what happened.They find a man inside an encapsulated mining car, who tells them not to open it. However, they do and he tries to kill them. One of the cops witnesses strange men with deep scarred and heavily tattooed faces killing the remaining survivors. The cops realise they need to leave the place fast.Desolation explains that the miners opened a kind of Martian construction in the soil which unleashed red dust. Those who breathed that dust became violent psychopaths who started to build weapons and kill the uninfected. They changed genetically, becoming distorted but much stronger.The cops and Desolation leave the prison with difficulty, and devise a plan to kill all the genetically modified ex-miners on the way out. However, the plan goes awry, and only Melanie and Desolation reach headquarters alive. Melanie realises that her bosses won't ever believe her. However, the red dust eventually arrives to headquarters, and Melanie and Desolation need to fight once again.", 'plot_id': '/m/03vyhn', 'question': 'How did the police arrive at the Mars mining camp?', 'question_id': 'b440de7d-9c3f-841c-eaec-a14bdff950d1', 'title': 'Ghosts of Mars'}

数据字段

  • plot_id:包含电影情节ID的字符串特征。
  • plot:包含电影情节文本的字符串特征。
  • title:包含电影标题的字符串特征。
  • question_id:包含问题ID的字符串特征。
  • question:包含问题文本的字符串特征。
  • answers:包含答案列表的字符串特征列表。
  • no_answer:一个布尔特征,用于通知问题是否没有答案。

数据拆分

数据被划分为训练集、开发集和测试集,使得得到的集合包含总QA对数的70%、15%和15%,并且测试集中不包括在训练集中看到的任何电影的QA对。最终的拆分大小如下:

名称训练Dec测试集自我循环60721 12961 12599ParaphraseRC 69524 15591 15857

数据集创建

策划理由

[需要更多信息]

源数据

维基百科和IMDb电影情节

初始数据收集和标准化

[需要更多信息]

谁是源语言的制片人?

[需要更多信息]

注释

注释过程

对于SelfRC,允许标注者在情节中标记答案范围,或在阅读维基百科电影情节后合成自己的答案。对于ParaphraseRC,使用了SelfRC中来自维基百科电影情节的问题,并要求标注者基于IMDb电影情节回答问题。

谁是标注者?

亚马逊机械土耳其工人

个人和敏感信息

[需要更多信息]

使用数据的考虑

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集创建者

该数据集最初由IIT Madras和IBM研究的Amrita Saha、Rahul Aralikatte、Mitesh M. Khapra和Karthik Sankaranarayanan共同创建。

许可信息

MIT License

引用信息

@inproceedings{DuoRC,
author = { Amrita Saha and Rahul Aralikatte and Mitesh M. Khapra and Karthik Sankaranarayanan},
title = {{DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension}},
booktitle = {Meeting of the Association for Computational Linguistics (ACL)},
year = {2018}
}

贡献

感谢 @gchhablani 添加此数据集。