数据集:

squad

任务:

问答

子任务:

extractive-qa

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

crowdsourced found

批注创建人:

crowdsourced

源数据集:

extended|wikipedia

预印本库:

arxiv:1606.05250

许可:

cc-by-4.0

数据集介绍文件清单

英文

“squad”数据集的数据卡

数据集简介

斯坦福问答数据集（SQuAD）是一个阅读理解数据集，由众包工人对一组维基百科文章提出问题，每个问题的答案都是对应阅读段落中的文本片段（或范围），或者问题可能无法回答。

支持的任务和排行榜

More Information Needed

语言

More Information Needed

数据集结构

数据示例

plain_text

下载的数据集文件大小：35.14 MB
生成的数据集大小：89.92 MB
总共使用的磁盘空间：125.06 MB

“train”集合的示例如下所示。

{
    "answers": {
        "answer_start": [1],
        "text": ["This is a test text"]
    },
    "context": "This is a test context.",
    "id": "1",
    "question": "Is this a test?",
    "title": "train test"
}

数据字段

所有分割数据集的数据字段都相同。

plain_text

id：字符串特征。
title：字符串特征。
context：字符串特征。
question：字符串特征。
answers：包含以下字段的字典特征：
- text：字符串特征。
- answer_start：int32特征。

数据划分

name	train	validation
plain_text	87599	10570

数据集创建

策划理由

More Information Needed

源数据

初始数据收集和规范化

More Information Needed

资源语言的生产者是谁？

More Information Needed

标注

标注过程

More Information Needed

标注者是谁？

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

其他信息

数据集策划者

More Information Needed

许可信息

More Information Needed

引用信息

@article{2016arXiv160605250R,
       author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},
                 Konstantin and {Liang}, Percy},
        title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
      journal = {arXiv e-prints},
         year = 2016,
          eid = {arXiv:1606.05250},
        pages = {arXiv:1606.05250},
archivePrefix = {arXiv},
       eprint = {1606.05250},
}

贡献者

感谢 @lewtun 、 @albertvillanova 、 @patrickvonplaten 、 @thomwolf 添加该数据集。

作者:

佚名

数据集大小:

16.09 KB

“squad”数据集的数据卡

数据集简介

支持的任务和排行榜

语言

数据集结构

数据示例

数据字段

数据划分

数据集创建

策划理由

源数据

标注

个人和敏感信息

使用数据的注意事项

数据集的社会影响

偏见讨论

其他已知限制

其他信息

数据集策划者

许可信息

引用信息

贡献者