数据集:
duorc
语言:
en计算机处理:
monolingual语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:1804.07927许可:
mitDuoRC数据集是一个用于英语的问题和答案数据集,这些问题和答案是从众包AMT工人在维基百科和IMDb电影情节上收集来的。工人们可以自由地从情节中选择答案或合成他们自己的答案。它包含两个子数据集 - SelfRC和ParaphraseRC。 SelfRC数据集仅基于维基百科电影情节构建。ParaphraseRC的问题是基于相应的IMDb电影情节从维基百科电影情节编写的,并给出了相应的答案。
抽象问答:该数据集可用于训练抽象问答模型。抽象问答模型提供一个段落和一个问题,并期望生成一个多词答案。模型的性能通过精确匹配和F1得分来衡量,类似于 SQuAD V1.1 或 SQuAD V2 。可以使用具有 dense retriever 的 BART-based model 和 dense retriever 来完成此任务。
抽取问答:该数据集可用于训练抽取问答模型。抽取问答模型提供一个段落和一个问题,并期望预测答案在段落中的起始和终止位置。模型的性能通过精确匹配和F1得分来衡量,类似于 SQuAD V1.1 或 SQuAD V2 。可使用 BertForQuestionAnswering 或任何其他类似的模型完成此任务。
数据集中的文本为英语,由维基百科电影情节的撰写人员使用。相关的BCP-47代码为en。
{'answers': ['They arrived by train.'], 'no_answer': False, 'plot': "200 years in the future, Mars has been colonized by a high-tech company.\nMelanie Ballard (Natasha Henstridge) arrives by train to a Mars mining camp which has cut all communication links with the company headquarters. She's not alone, as she is with a group of fellow police officers. They find the mining camp deserted except for a person in the prison, Desolation Williams (Ice Cube), who seems to laugh about them because they are all going to die. They were supposed to take Desolation to headquarters, but decide to explore first to find out what happened.They find a man inside an encapsulated mining car, who tells them not to open it. However, they do and he tries to kill them. One of the cops witnesses strange men with deep scarred and heavily tattooed faces killing the remaining survivors. The cops realise they need to leave the place fast.Desolation explains that the miners opened a kind of Martian construction in the soil which unleashed red dust. Those who breathed that dust became violent psychopaths who started to build weapons and kill the uninfected. They changed genetically, becoming distorted but much stronger.The cops and Desolation leave the prison with difficulty, and devise a plan to kill all the genetically modified ex-miners on the way out. However, the plan goes awry, and only Melanie and Desolation reach headquarters alive. Melanie realises that her bosses won't ever believe her. However, the red dust eventually arrives to headquarters, and Melanie and Desolation need to fight once again.", 'plot_id': '/m/03vyhn', 'question': 'How did the police arrive at the Mars mining camp?', 'question_id': 'b440de7d-9c3f-841c-eaec-a14bdff950d1', 'title': 'Ghosts of Mars'}
数据被划分为训练集、开发集和测试集,使得得到的集合包含总QA对数的70%、15%和15%,并且测试集中不包括在训练集中看到的任何电影的QA对。最终的拆分大小如下:
名称训练Dec测试集自我循环60721 12961 12599ParaphraseRC 69524 15591 15857
[需要更多信息]
维基百科和IMDb电影情节
初始数据收集和标准化[需要更多信息]
谁是源语言的制片人?[需要更多信息]
对于SelfRC,允许标注者在情节中标记答案范围,或在阅读维基百科电影情节后合成自己的答案。对于ParaphraseRC,使用了SelfRC中来自维基百科电影情节的问题,并要求标注者基于IMDb电影情节回答问题。
谁是标注者?亚马逊机械土耳其工人
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
该数据集最初由IIT Madras和IBM研究的Amrita Saha、Rahul Aralikatte、Mitesh M. Khapra和Karthik Sankaranarayanan共同创建。
@inproceedings{DuoRC, author = { Amrita Saha and Rahul Aralikatte and Mitesh M. Khapra and Karthik Sankaranarayanan}, title = {{DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension}}, booktitle = {Meeting of the Association for Computational Linguistics (ACL)}, year = {2018} }
感谢 @gchhablani 添加此数据集。