数据集:

break_data

任务:

文生文

子任务:

open-domain-abstractive-qa

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

crowdsourced

批注创建人:

crowdsourced

源数据集:

original

许可:

license:unknown

数据集介绍文件清单

英文

"break_data" 数据集卡片

数据集摘要

Break 是一个人工标注的自然语言问题及其问题分解意义表示（Question Decomposition Meaning Representations，QDMR）的数据集。Break 数据集包含来自文本、图像和数据库的10个问答数据集中的83,978个示例。此存储库包含 Break 数据集以及有关确切数据格式的信息。

支持的任务和排行榜

More Information Needed

语言

More Information Needed

数据集结构

数据实例

QDMR

下载的数据集文件大小：15.97 MB
生成的数据集大小：15.93 MB
总的硬盘使用量：31.90 MB

“验证”示例如下。

{
    "decomposition": "return flights ;return #1 from  denver ;return #2 to philadelphia ;return #3 if  available",
    "operators": "['select', 'filter', 'filter', 'filter']",
    "question_id": "ATIS_dev_0",
    "question_text": "what flights are available tomorrow from denver to philadelphia ",
    "split": "dev"
}

QDMR-high-level

下载的数据集文件大小：15.97 MB
生成的数据集大小：6.54 MB
总的硬盘使用量：22.51 MB

“训练”示例如下。

{
    "decomposition": "return ground transportation ;return #1 which  is  available ;return #2 from  the pittsburgh airport ;return #3 to downtown ;return the cost of #4",
    "operators": "['select', 'filter', 'filter', 'filter', 'project']",
    "question_id": "ATIS_dev_102",
    "question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ",
    "split": "dev"
}

QDMR-high-level-lexicon

下载的数据集文件大小：15.97 MB
生成的数据集大小：31.64 MB
总的硬盘使用量：47.61 MB

“训练”示例如下。

This example was too long and was cropped:

{
    "allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'he', 'distinct', 'House', 'two', 'at least', 'or ', 'date', 'o...",
    "source": "What office, also held by a member of the Maine House of Representatives, did James K. Polk hold before he was president?"
}

QDMR-lexicon

下载的数据集文件大小：15.97 MB
生成的数据集大小：77.19 MB
总的硬盘使用量：93.16 MB

“验证”示例如下。

This example was too long and was cropped:

{
    "allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'distinct', 'two', 'at least', 'or ', 'date', 'on ', '@@14@@', ...",
    "source": "what flights are available tomorrow from denver to philadelphia "
}

logical-forms

下载的数据集文件大小：15.97 MB
生成的数据集大小：24.25 MB
总的硬盘使用量：40.22 MB

“训练”示例如下。

{
    "decomposition": "return ground transportation ;return #1 which  is  available ;return #2 from  the pittsburgh airport ;return #3 to downtown ;return the cost of #4",
    "operators": "['select', 'filter', 'filter', 'filter', 'project']",
    "program": "some program",
    "question_id": "ATIS_dev_102",
    "question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ",
    "split": "dev"
}

数据字段

所有拆分之间的数据字段相同。

QDMR

question_id：字符串特征
question_text：字符串特征
decomposition：字符串特征
operators：字符串特征
split：字符串特征

QDMR-high-level

question_id：字符串特征
question_text：字符串特征
decomposition：字符串特征
operators：字符串特征
split：字符串特征

QDMR-high-level-lexicon

source：字符串特征
allowed_tokens：字符串特征

QDMR-lexicon

source：字符串特征
allowed_tokens：字符串特征

logical-forms

question_id：字符串特征
question_text：字符串特征
decomposition：字符串特征
operators：字符串特征
split：字符串特征
program：字符串特征

数据拆分

name	train	validation	test
QDMR	44321	7760	8069
QDMR-high-level	17503	3130	3195
QDMR-high-level-lexicon	17503	3130	3195
QDMR-lexicon	44321	7760	8069
logical-forms	44098	7719	8006

数据集创建

策划理由

More Information Needed

源数据

初始数据收集和规范化

More Information Needed

资源语言的制作方是谁？

More Information Needed

注释

注释过程

More Information Needed

注释者是谁？

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

附加信息

数据集维护者

More Information Needed

许可信息

More Information Needed

引用信息

@article{Wolfson2020Break,
  title={Break It Down: A Question Understanding Benchmark},
  author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan},
  journal={Transactions of the Association for Computational Linguistics},
  year={2020},
}

贡献者

感谢 @patrickvonplaten 、 @lewtun 和 @thomwolf 添加了该数据集。

作者:

佚名

数据集大小:

32.84 KB

"break_data" 数据集卡片

数据集摘要

支持的任务和排行榜

语言

数据集结构

数据实例

数据字段

数据拆分

数据集创建

策划理由

源数据

注释

个人和敏感信息

使用数据的注意事项

数据的社会影响

偏见讨论

其他已知限制

附加信息

数据集维护者

许可信息

引用信息

贡献者