数据集:
break_data
任务:
文生文语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original许可:
license:unknownBreak is a human annotated dataset of natural language questions and their Question Decomposition Meaning Representations (QDMRs). Break consists of 83,978 examples sampled from 10 question answering datasets over text, images and databases. This repository contains the Break dataset along with information on the exact data format.
An example of 'validation' looks as follows.
{ "decomposition": "return flights ;return #1 from denver ;return #2 to philadelphia ;return #3 if available", "operators": "['select', 'filter', 'filter', 'filter']", "question_id": "ATIS_dev_0", "question_text": "what flights are available tomorrow from denver to philadelphia ", "split": "dev" }QDMR-high-level
An example of 'train' looks as follows.
{ "decomposition": "return ground transportation ;return #1 which is available ;return #2 from the pittsburgh airport ;return #3 to downtown ;return the cost of #4", "operators": "['select', 'filter', 'filter', 'filter', 'project']", "question_id": "ATIS_dev_102", "question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ", "split": "dev" }QDMR-high-level-lexicon
An example of 'train' looks as follows.
This example was too long and was cropped: { "allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'he', 'distinct', 'House', 'two', 'at least', 'or ', 'date', 'o...", "source": "What office, also held by a member of the Maine House of Representatives, did James K. Polk hold before he was president?" }QDMR-lexicon
An example of 'validation' looks as follows.
This example was too long and was cropped: { "allowed_tokens": "\"['higher than', 'same as', 'what ', 'and ', 'than ', 'at most', 'distinct', 'two', 'at least', 'or ', 'date', 'on ', '@@14@@', ...", "source": "what flights are available tomorrow from denver to philadelphia " }logical-forms
An example of 'train' looks as follows.
{ "decomposition": "return ground transportation ;return #1 which is available ;return #2 from the pittsburgh airport ;return #3 to downtown ;return the cost of #4", "operators": "['select', 'filter', 'filter', 'filter', 'project']", "program": "some program", "question_id": "ATIS_dev_102", "question_text": "what ground transportation is available from the pittsburgh airport to downtown and how much does it cost ", "split": "dev" }
The data fields are the same among all splits.
QDMRname | train | validation | test |
---|---|---|---|
QDMR | 44321 | 7760 | 8069 |
QDMR-high-level | 17503 | 3130 | 3195 |
QDMR-high-level-lexicon | 17503 | 3130 | 3195 |
QDMR-lexicon | 44321 | 7760 | 8069 |
logical-forms | 44098 | 7719 | 8006 |
@article{Wolfson2020Break, title={Break It Down: A Question Understanding Benchmark}, author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan}, journal={Transactions of the Association for Computational Linguistics}, year={2020}, }
Thanks to @patrickvonplaten , @lewtun , @thomwolf for adding this dataset.