You can find the main data card on the GEM Website .
DART is an English dataset aggregating multiple other data-to-text dataset in a common triple-based format. The new format is completely flat, thus not requiring a model to learn hierarchical structures, while still retaining the full information.
You can load the dataset via:
import datasets data = datasets.load_dataset('GEM/dart')
The data loader can be found here .
websiten/a
paper authorsLinyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto, Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Mutethia Mutuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Yi Chern Tan, Xi Victoria Lin, Caiming Xiong, Richard Socher, Nazneen Fatema Rajani
@inproceedings{nan-etal-2021-dart, title = "{DART}: Open-Domain Structured Data Record to Text Generation", author = "Nan, Linyong and Radev, Dragomir and Zhang, Rui and Rau, Amrit and Sivaprasad, Abhinand and Hsieh, Chiachun and Tang, Xiangru and Vyas, Aadit and Verma, Neha and Krishna, Pranav and Liu, Yangxiaokang and Irwanto, Nadia and Pan, Jessica and Rahman, Faiaz and Zaidi, Ahmad and Mutuma, Mutethia and Tarabar, Yasin and Gupta, Ankit and Yu, Tao and Tan, Yi Chern and Lin, Xi Victoria and Xiong, Caiming and Socher, Richard and Rajani, Nazneen Fatema", booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jun, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.naacl-main.37", doi = "10.18653/v1/2021.naacl-main.37", pages = "432--447", abstract = "We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.", }Contact Name
Dragomir Radev, Rui Zhang, Nazneen Rajani
Contact Email{dragomir.radev, r.zhang}@yale.edu, {nazneen.rajani}@salesforce.com
Has a Leaderboard?yes
Leaderboard Link Leaderboard DetailsSeveral state-of-the-art table-to-text models were evaluated on DART, such as BART ( Lewis et al., 2020 ), Seq2Seq-Att ( MELBOURNE ) and End-to-End Transformer ( Castro Ferreira et al., 2019 ). The leaderboard reports BLEU, METEOR, TER, MoverScore, BERTScore and BLEURT scores.
no
Covered DialectsIt is an aggregated from multiple other datasets that use general US-American or British English without differentiation between dialects.
Covered LanguagesEnglish
Whose Language?The dataset is aggregated from multiple others that were crowdsourced on different platforms.
Licensemit: MIT License
Intended UseThe dataset is aimed to further research in natural language generation from semantic data.
Primary TaskData-to-Text
Communicative GoalThe speaker is required to produce coherent sentences and construct a trees structured ontology of the column headers.
academic , industry
Curation Organization(s)Yale University, Salesforce Research, Penn State University, The University of Hong Kong, MIT
Dataset CreatorsLinyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto, Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Mutethia Mutuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Yi Chern Tan, Xi Victoria Lin, Caiming Xiong, Richard Socher, Nazneen Fatema Rajani
Who added the Dataset to GEM?Miruna Clinciu contributed the original data card and Yacine Jernite wrote the initial data loader. Sebastian Gehrmann migrated the data card and the loader to the new format.
- tripleset : a list of tuples, each tuple has 3 items - subtree_was_extended : a boolean variable (true or false) - annotations : a list of dict, each with source and text keys. - source : a string mentioning the name of the source table. - text : a sentence string.
Reason for StructureThe structure is supposed to be able more complex structures beyond "flat" attribute-value pairs, instead encoding hierarchical relationships.
How were labels chosen?They are a combination of those from existing datasets and new annotations that take advantage of the hierarchical structure
Example Instance{ "tripleset": [ [ "Ben Mauk", "High school", "Kenton" ], [ "Ben Mauk", "College", "Wake Forest Cincinnati" ] ], "subtree_was_extended": false, "annotations": [ { "source": "WikiTableQuestions_lily", "text": "Ben Mauk, who attended Kenton High School, attended Wake Forest Cincinnati for college." } ] }Data Splits
|Input Unit | Examples | Vocab Size | Words per SR | Sents per SR | Tables | | ------------- | ------------- || ------------- || ------------- || ------------- || ------------- | |Triple Set | 82,191 | 33.2K | 21.6 | 1.5 | 5,623 |
| Train | Dev | Test| | ------------- | ------------- || ------------- | | 62,659 | 6,980 | 12,552|
Statistics of DART decomposed by different collection methods. DART exhibits a great deal of topical variety in terms of the number of unique predicates, the number of unique triples, and the vocabulary size. These statistics are computed from DART v1.1.1; the number of unique predicates reported is post-unification (see Section 3.4). SR: Surface Realization. ( details in Table 1 and 2 ).
Splitting CriteriaFor WebNLG 2017 and Cleaned E2E, DART use the original data splits. For the new annotation on WikiTableQuestions and WikiSQL, random splitting will make train, dev, and test splits contain similar tables and similar <triple-set, sentence> examples. They are thus split based on Jaccard similarity such that no training examples has a similarity with a test example of over 0.5
DART is a large and open-domain structured DAta Record to Text generation corpus with high-quality sentence annotations with each input being a set of entity-relation triples following a tree-structured ontology.
Similar Datasetsyes
Unique Language Coverageno
Difference from other GEM datasetsThe tree structure is unique among GEM datasets
Ability that the Dataset measuresReasoning, surface realization
no
Additional Splits?no
Experimental results on DART shows that BART model as the highest performance among three models with a BLEU score of 37.06. This is attributed to BART’s generalization ability due to pretraining ( Table 4 ).
Reasoning, surface realization
MetricsBLEU , MoverScore , BERT-Score , BLEURT
Proposed EvaluationThe leaderboard uses the combination of BLEU, METEOR, TER, MoverScore, BERTScore, PARENT and BLEURT to overcome the limitations of the n-gram overlap metrics. A small scale human annotation of 100 data points was conducted along the dimensions of (1) fluency - a sentence is natural and grammatical, and (2) semantic faithfulness - a sentence is supported by the input triples.
Previous results available?yes
Other Evaluation Approachesn/a
Relevant Previous ResultsBART currently achieves the best performance according to the leaderboard.
The dataset creators encourage through DART further research in natural language generation from semantic data. DART provides high-quality sentence annotations with each input being a set of entity-relation triples in a tree structure.
Communicative GoalThe speaker is required to produce coherent sentences and construct a trees structured ontology of the column headers.
Sourced from Different Sourcesyes
Source DetailsFound , Created for the dataset
Where was it found?Offline media collection
Creation ProcessCreators proposed a two-stage annotation process for constructing triple set sentence pairs based on a tree-structured ontology of each table. First, internal skilled annotators denote the parent column for each column header. Then, a larger number of annotators provide a sentential description of an automatically-chosen subset of table cells in a row. To form a triple set sentence pair, the highlighted cells can be converted to a connected triple set automatically according to the column ontology for the given table.
Language ProducersNo further information about the MTurk workers has been provided.
Topics CoveredThe sub-datasets are from Wikipedia, DBPedia, and artificially created restaurant data.
Data Validationvalidated by crowdworker
Was Data Filtered?not filtered
none
Annotation Service?no
no
Justification for Using the DataThe new annotations are based on Wikipedia which is in the public domain and the other two datasets permit reuse (with attribution)
no PII
Justification for no PIINone of the datasets talk about individuals
no
no
no
no
Are the Language Producers Representative of the Language?No, the annotators are raters on crowdworking platforms and thus only represent their demographics.
open license - commercial use allowed
Copyright Restrictions on the Language Dataopen license - commercial use allowed
The dataset may contain some social biases, as the input sentences are based on Wikipedia (WikiTableQuestions, WikiSQL, WebNLG). Studies have shown that the English Wikipedia contains gender biases( Dinan et al., 2020 ), racial biases([Papakyriakopoulos et al., 2020 ( https://dl.acm.org/doi/pdf/10.1145/3351095.3372843 )) and geographical bias( Livingstone et al., 2010 ). More info .
Unsuited ApplicationsThe end-to-end transformer has the lowest performance since the transformer model needs intermediate pipeline planning steps to have higher performance. Similar findings can be found in Castro Ferreira et al., 2019 .