数据集:

break_data

任务:

文生文

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

crowdsourced

批注创建人:

crowdsourced

源数据集:

original
英文

"break_data" 数据集卡片

数据集摘要

Break 是一个人工标注的自然语言问题及其问题分解意义表示(Question Decomposition Meaning Representations,QDMR)的数据集。Break 数据集包含来自文本、图像和数据库的10个问答数据集中的83,978个示例。此存储库包含 Break 数据集以及有关确切数据格式的信息。

支持的任务和排行榜

More Information Needed

语言

More Information Needed

数据集结构

数据实例

QDMR
  • 下载的数据集文件大小:15.97 MB
  • 生成的数据集大小:15.93 MB
  • 总的硬盘使用量:31.90 MB

“验证”示例如下。

{
    "decomposition": "return flights ;return #1 from  denver ;return #2 to philadelphia ;return #3 if  available",
    "operators": "['select', 'filter', 'filter', 'filter']",
    "question_id": "ATIS_dev_0",
    "question_text": "what flights are available tomorrow from denver to philadelphia ",
    "split": "dev"
}
QDMR-high-level
  • 下载的数据集文件大小:15.97 MB
  • 生成的数据集大小:6.54 MB
  • 总的硬盘使用量:22.51 MB

“训练”示例如下。

{
    "decomposition": "return ground transportation ;return #1 which  is  available ;return #2 from  the pittsburgh airport ;return #3 to downtown ;return the cost of #4",
    "operators": "['select', 'filter', 'filter', 'filter', 'project']",
    "question_id": "ATIS_dev_102",
    "question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ",
    "split": "dev"
}
QDMR-high-level-lexicon
  • 下载的数据集文件大小:15.97 MB
  • 生成的数据集大小:31.64 MB
  • 总的硬盘使用量:47.61 MB

“训练”示例如下。

This example was too long and was cropped:

{
    "allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'he', 'distinct', 'House', 'two', 'at least', 'or ', 'date', 'o...",
    "source": "What office, also held by a member of the Maine House of Representatives, did James K. Polk hold before he was president?"
}
QDMR-lexicon
  • 下载的数据集文件大小:15.97 MB
  • 生成的数据集大小:77.19 MB
  • 总的硬盘使用量:93.16 MB

“验证”示例如下。

This example was too long and was cropped:

{
    "allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'distinct', 'two', 'at least', 'or ', 'date', 'on ', '@@14@@', ...",
    "source": "what flights are available tomorrow from denver to philadelphia "
}
logical-forms
  • 下载的数据集文件大小:15.97 MB
  • 生成的数据集大小:24.25 MB
  • 总的硬盘使用量:40.22 MB

“训练”示例如下。

{
    "decomposition": "return ground transportation ;return #1 which  is  available ;return #2 from  the pittsburgh airport ;return #3 to downtown ;return the cost of #4",
    "operators": "['select', 'filter', 'filter', 'filter', 'project']",
    "program": "some program",
    "question_id": "ATIS_dev_102",
    "question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ",
    "split": "dev"
}

数据字段

所有拆分之间的数据字段相同。

QDMR
  • question_id:字符串特征
  • question_text:字符串特征
  • decomposition:字符串特征
  • operators:字符串特征
  • split:字符串特征
QDMR-high-level
  • question_id:字符串特征
  • question_text:字符串特征
  • decomposition:字符串特征
  • operators:字符串特征
  • split:字符串特征
QDMR-high-level-lexicon
  • source:字符串特征
  • allowed_tokens:字符串特征
QDMR-lexicon
  • source:字符串特征
  • allowed_tokens:字符串特征
logical-forms
  • question_id:字符串特征
  • question_text:字符串特征
  • decomposition:字符串特征
  • operators:字符串特征
  • split:字符串特征
  • program:字符串特征

数据拆分

name train validation test
QDMR 44321 7760 8069
QDMR-high-level 17503 3130 3195
QDMR-high-level-lexicon 17503 3130 3195
QDMR-lexicon 44321 7760 8069
logical-forms 44098 7719 8006

数据集创建

策划理由

More Information Needed

源数据

初始数据收集和规范化

More Information Needed

资源语言的制作方是谁?

More Information Needed

注释

注释过程

More Information Needed

注释者是谁?

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

数据的社会影响

More Information Needed

偏见讨论

More Information Needed

其他已知限制

More Information Needed

附加信息

数据集维护者

More Information Needed

许可信息

More Information Needed

引用信息

@article{Wolfson2020Break,
  title={Break It Down: A Question Understanding Benchmark},
  author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan},
  journal={Transactions of the Association for Computational Linguistics},
  year={2020},
}

贡献者

感谢 @patrickvonplaten @lewtun @thomwolf 添加了该数据集。