数据集:

GEM/dart

语言:

en

计算机处理:

unknown

语言创建人:

unknown

批注创建人:

none

源数据集:

original

许可:

mit
中文

Dataset Card for GEM/dart

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

DART is an English dataset aggregating multiple other data-to-text dataset in a common triple-based format. The new format is completely flat, thus not requiring a model to learn hierarchical structures, while still retaining the full information.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/dart')

The data loader can be found here .

website

n/a

paper

ACL Anthology

authors

Linyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto, Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Mutethia Mutuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Yi Chern Tan, Xi Victoria Lin, Caiming Xiong, Richard Socher, Nazneen Fatema Rajani

Dataset Overview

Where to find the Data and its Documentation

Download

Github

Paper

ACL Anthology

BibTex
@inproceedings{nan-etal-2021-dart,
    title = "{DART}: Open-Domain Structured Data Record to Text Generation",
    author = "Nan, Linyong  and
      Radev, Dragomir  and
      Zhang, Rui  and
      Rau, Amrit  and
      Sivaprasad, Abhinand  and
      Hsieh, Chiachun  and
      Tang, Xiangru  and
      Vyas, Aadit  and
      Verma, Neha  and
      Krishna, Pranav  and
      Liu, Yangxiaokang  and
      Irwanto, Nadia  and
      Pan, Jessica  and
      Rahman, Faiaz  and
      Zaidi, Ahmad  and
      Mutuma, Mutethia  and
      Tarabar, Yasin  and
      Gupta, Ankit  and
      Yu, Tao  and
      Tan, Yi Chern  and
      Lin, Xi Victoria  and
      Xiong, Caiming  and
      Socher, Richard  and
      Rajani, Nazneen Fatema",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.naacl-main.37",
    doi = "10.18653/v1/2021.naacl-main.37",
    pages = "432--447",
    abstract = "We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.",
}
Contact Name

Dragomir Radev, Rui Zhang, Nazneen Rajani

Contact Email

{dragomir.radev, r.zhang}@yale.edu, {nazneen.rajani}@salesforce.com

Has a Leaderboard?

yes

Leaderboard Link

Leaderboard

Leaderboard Details

Several state-of-the-art table-to-text models were evaluated on DART, such as BART ( Lewis et al., 2020 ), Seq2Seq-Att ( MELBOURNE ) and End-to-End Transformer ( Castro Ferreira et al., 2019 ). The leaderboard reports BLEU, METEOR, TER, MoverScore, BERTScore and BLEURT scores.

Languages and Intended Use

Multilingual?

no

Covered Dialects

It is an aggregated from multiple other datasets that use general US-American or British English without differentiation between dialects.

Covered Languages

English

Whose Language?

The dataset is aggregated from multiple others that were crowdsourced on different platforms.

License

mit: MIT License

Intended Use

The dataset is aimed to further research in natural language generation from semantic data.

Primary Task

Data-to-Text

Communicative Goal

The speaker is required to produce coherent sentences and construct a trees structured ontology of the column headers.

Credit

Curation Organization Type(s)

academic , industry

Curation Organization(s)

Yale University, Salesforce Research, Penn State University, The University of Hong Kong, MIT

Dataset Creators

Linyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto, Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Mutethia Mutuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Yi Chern Tan, Xi Victoria Lin, Caiming Xiong, Richard Socher, Nazneen Fatema Rajani

Who added the Dataset to GEM?

Miruna Clinciu contributed the original data card and Yacine Jernite wrote the initial data loader. Sebastian Gehrmann migrated the data card and the loader to the new format.

Dataset Structure

Data Fields

- tripleset : a list of tuples, each tuple has 3 items - subtree_was_extended : a boolean variable (true or false) - annotations : a list of dict, each with source and text keys. - source : a string mentioning the name of the source table. - text : a sentence string.

Reason for Structure

The structure is supposed to be able more complex structures beyond "flat" attribute-value pairs, instead encoding hierarchical relationships.

How were labels chosen?

They are a combination of those from existing datasets and new annotations that take advantage of the hierarchical structure

Example Instance
 {
    "tripleset": [
      [
        "Ben Mauk",
        "High school",
        "Kenton"
      ],
      [
        "Ben Mauk",
        "College",
        "Wake Forest Cincinnati"
      ]
    ],
    "subtree_was_extended": false,
    "annotations": [
      {
        "source": "WikiTableQuestions_lily",
        "text": "Ben Mauk, who attended Kenton High School, attended Wake Forest Cincinnati for college."
      }
    ]
  }
Data Splits

|Input Unit | Examples | Vocab Size | Words per SR | Sents per SR | Tables | | ------------- | ------------- || ------------- || ------------- || ------------- || ------------- | |Triple Set | 82,191 | 33.2K | 21.6 | 1.5 | 5,623 |

| Train | Dev | Test| | ------------- | ------------- || ------------- | | 62,659 | 6,980 | 12,552|

Statistics of DART decomposed by different collection methods. DART exhibits a great deal of topical variety in terms of the number of unique predicates, the number of unique triples, and the vocabulary size. These statistics are computed from DART v1.1.1; the number of unique predicates reported is post-unification (see Section 3.4). SR: Surface Realization. ( details in Table 1 and 2 ).

Splitting Criteria

For WebNLG 2017 and Cleaned E2E, DART use the original data splits. For the new annotation on WikiTableQuestions and WikiSQL, random splitting will make train, dev, and test splits contain similar tables and similar <triple-set, sentence> examples. They are thus split based on Jaccard similarity such that no training examples has a similarity with a test example of over 0.5

Dataset in GEM

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

DART is a large and open-domain structured DAta Record to Text generation corpus with high-quality sentence annotations with each input being a set of entity-relation triples following a tree-structured ontology.

Similar Datasets

yes

Unique Language Coverage

no

Difference from other GEM datasets

The tree structure is unique among GEM datasets

Ability that the Dataset measures

Reasoning, surface realization

GEM-Specific Curation

Modificatied for GEM?

no

Additional Splits?

no

Getting Started with the Task

Pointers to Resources

Experimental results on DART shows that BART model as the highest performance among three models with a BLEU score of 37.06. This is attributed to BART’s generalization ability due to pretraining ( Table 4 ).

Previous Results

Previous Results

Measured Model Abilities

Reasoning, surface realization

Metrics

BLEU , MoverScore , BERT-Score , BLEURT

Proposed Evaluation

The leaderboard uses the combination of BLEU, METEOR, TER, MoverScore, BERTScore, PARENT and BLEURT to overcome the limitations of the n-gram overlap metrics. A small scale human annotation of 100 data points was conducted along the dimensions of (1) fluency - a sentence is natural and grammatical, and (2) semantic faithfulness - a sentence is supported by the input triples.

Previous results available?

yes

Other Evaluation Approaches

n/a

Relevant Previous Results

BART currently achieves the best performance according to the leaderboard.

Dataset Curation

Original Curation

Original Curation Rationale

The dataset creators encourage through DART further research in natural language generation from semantic data. DART provides high-quality sentence annotations with each input being a set of entity-relation triples in a tree structure.

Communicative Goal

The speaker is required to produce coherent sentences and construct a trees structured ontology of the column headers.

Sourced from Different Sources

yes

Source Details

Language Data

How was Language Data Obtained?

Found , Created for the dataset

Where was it found?

Offline media collection

Creation Process

Creators proposed a two-stage annotation process for constructing triple set sentence pairs based on a tree-structured ontology of each table. First, internal skilled annotators denote the parent column for each column header. Then, a larger number of annotators provide a sentential description of an automatically-chosen subset of table cells in a row. To form a triple set sentence pair, the highlighted cells can be converted to a connected triple set automatically according to the column ontology for the given table.

Language Producers

No further information about the MTurk workers has been provided.

Topics Covered

The sub-datasets are from Wikipedia, DBPedia, and artificially created restaurant data.

Data Validation

validated by crowdworker

Was Data Filtered?

not filtered

Structured Annotations

Additional Annotations?

none

Annotation Service?

no

Consent

Any Consent Policy?

no

Justification for Using the Data

The new annotations are based on Wikipedia which is in the public domain and the other two datasets permit reuse (with attribution)

Private Identifying Information (PII)

Contains PII?

no PII

Justification for no PII

None of the datasets talk about individuals

Maintenance

Any Maintenance Plan?

no

Broader Social Context

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

no

Impact on Under-Served Communities

Addresses needs of underserved Communities?

no

Discussion of Biases

Any Documented Social Biases?

no

Are the Language Producers Representative of the Language?

No, the annotators are raters on crowdworking platforms and thus only represent their demographics.

Considerations for Using the Data

PII Risks and Liability

Licenses

Copyright Restrictions on the Dataset

open license - commercial use allowed

Copyright Restrictions on the Language Data

open license - commercial use allowed

Known Technical Limitations

Technical Limitations

The dataset may contain some social biases, as the input sentences are based on Wikipedia (WikiTableQuestions, WikiSQL, WebNLG). Studies have shown that the English Wikipedia contains gender biases( Dinan et al., 2020 ), racial biases([Papakyriakopoulos et al., 2020 ( https://dl.acm.org/doi/pdf/10.1145/3351095.3372843 )) and geographical bias( Livingstone et al., 2010 ). More info .

Unsuited Applications

The end-to-end transformer has the lowest performance since the transformer model needs intermediate pipeline planning steps to have higher performance. Similar findings can be found in Castro Ferreira et al., 2019 .