数据集:

GEM/e2e_nlg

语言:

en

计算机处理:

unknown

语言创建人:

unknown

批注创建人:

none

源数据集:

original
中文

Dataset Card for GEM/e2e_nlg

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

The E2E NLG dataset is an English benchmark dataset for data-to-text models that verbalize a set of 2-9 key-value attribute pairs in the restaurant domain. The version used for GEM is the cleaned E2E NLG dataset, which filters examples with hallucinations and outputs that don't fully cover all input attributes.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/e2e_nlg')

The data loader can be found here .

website

Website

paper

First data release , Detailed E2E Challenge writeup , Cleaned E2E version

authors

Jekaterina Novikova, Ondrej Dusek and Verena Rieser

Dataset Overview

Where to find the Data and its Documentation

Webpage

Website

Download

Github

Paper

First data release , Detailed E2E Challenge writeup , Cleaned E2E version

BibTex
@inproceedings{e2e_cleaned,
    address = {Tokyo, Japan},
    title = {Semantic {Noise} {Matters} for {Neural} {Natural} {Language} {Generation}},
    url = {https://www.aclweb.org/anthology/W19-8652/},
    booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
    author = {Dušek, Ondřej and Howcroft, David M and Rieser, Verena},
    year = {2019},
    pages = {421--426},
}
Contact Name

Ondrej Dusek

Contact Email

odusek@ufal.mff.cuni.cz

Has a Leaderboard?

no

Languages and Intended Use

Multilingual?

no

Covered Dialects

Dialect-specific data was not collected and the language is general British English.

Covered Languages

English

Whose Language?

The original dataset was collected using the CrowdFlower (now Appen) platform using native English speakers (self-reported). No demographic information was provided, but the collection was geographically limited to English-speaking countries.

License

cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International

Intended Use

The dataset was collected to test neural model on a very well specified realization task.

Primary Task

Data-to-Text

Communicative Goal

Producing a text informing/recommending a restaurant, given all and only the attributes specified on the input.

Credit

Curation Organization Type(s)

academic

Curation Organization(s)

Heriot-Watt University

Dataset Creators

Jekaterina Novikova, Ondrej Dusek and Verena Rieser

Funding

This research received funding from the EPSRC projects DILiGENt (EP/M005429/1) and MaDrIgAL (EP/N017536/1).

Who added the Dataset to GEM?

Simon Mille wrote the initial data card and Yacine Jernite the data loader. Sebastian Gehrmann migrated the data card to the v2 format and moved the data loader to the hub.

Dataset Structure

Data Fields

The data is in a CSV format, with the following fields:

  • mr -- the meaning representation (MR, input)
  • ref -- reference, i.e. the corresponding natural-language description (output)

There are additional fields ( fixed , orig_mr ) indicating whether the data was modified in the cleaning process and what was the original MR before cleaning, but these aren't used for NLG.

The MR has a flat structure -- attribute-value pairs are comma separated, with values enclosed in brackets (see example above). There are 8 attributes:

  • name -- restaurant name
  • near -- a landmark close to the restaurant
  • area -- location (riverside, city centre)
  • food -- food type / cuisine (e.g. Japanese, Indian, English etc.)
  • eatType -- restaurant type (restaurant, coffee shop, pub)
  • priceRange -- price range (low, medium, high, <£20, £20-30, >£30)
  • rating -- customer rating (low, medium, high, 1/5, 3/5, 5/5)
  • familyFriendly -- is the restaurant family-friendly (yes/no)

The same MR is often repeated multiple times with different synonymous references.

How were labels chosen?

The source MRs were generated automatically at random from a set of valid attribute values. The labels were crowdsourced and are natural language

Example Instance
{
  "input":  "name[Alimentum], area[riverside], familyFriendly[yes], near[Burger King]",
  "target": "Alimentum is a kids friendly place in the riverside area near Burger King." 
}
Data Splits
MRs Distinct MRs References
Training 12,568 8,362 33,525
Development 1,484 1,132 4,299
Test 1,847 1,358 4,693
Total 15,899 10,852 42,517

“Distinct MRs” are MRs that remain distinct even if restaurant/place names (attributes name , near ) are delexicalized, i.e., replaced with a placeholder.

Splitting Criteria

The data are divided so that MRs in different splits do not overlap.

Dataset in GEM

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

The E2E dataset is one of the largest limited-domain NLG datasets and is frequently used as a data-to-text generation benchmark. The E2E Challenge included 20 systems of very different architectures, with system outputs available for download.

Similar Datasets

yes

Unique Language Coverage

no

Difference from other GEM datasets

The dataset is much cleaner than comparable datasets, and it is also a relatively easy task, making for a straightforward evaluation.

Ability that the Dataset measures

surface realization.

GEM-Specific Curation

Modificatied for GEM?

yes

Additional Splits?

yes

Split Information

4 special test sets for E2E were added to the GEM evaluation suite.

  • We created subsets of the training and development sets of ~500 randomly selected inputs each.
  • We applied input scrambling on a subset of 500 randomly selected test instances; the order of the input properties was randomly reassigned.
  • For the input size, we created subpopulations based on the number of restaurant properties in the input.
  • Input length Frequency English
    2 5
    3 120
    4 389
    5 737
    6 1187
    7 1406
    8 774
    9 73
    10 2
    Split Motivation

    Generalization and robustness

    Getting Started with the Task

    Previous Results

    Previous Results

    Measured Model Abilities

    Surface realization.

    Metrics

    BLEU , METEOR , ROUGE

    Proposed Evaluation

    The official evaluation script combines the MT-Eval and COCO Captioning libraries with the following metrics.

    • BLEU
    • CIDEr
    • NIST
    • METEOR
    • ROUGE-L
    Previous results available?

    yes

    Other Evaluation Approaches

    Most previous results, including the shared task results, used the library provided by the dataset creators. The shared task also conducted a human evaluation using the following two criteria:

    • Quality : When collecting quality ratings, system outputs were presented to crowd workers together with the corresponding meaning representation, which implies that correctness of the NL utterance relative to the MR should also influence this ranking. The crowd workers were asked: “How do you judge the overall quality of the utterance in terms of its grammatical correctness, fluency, adequacy and other important factors?”
    • Naturalness : When collecting naturalness ratings, system outputs were presented to crowd workers without the corresponding meaning representation. The crowd workers were asked: “Could the utterance have been produced by a native speaker?”
    Relevant Previous Results

    The shared task writeup has in-depth evaluations of systems ( https://www.sciencedirect.com/science/article/pii/S0885230819300919 )

    Dataset Curation

    Original Curation

    Original Curation Rationale

    The dataset was collected to showcase/test neural NLG models. It is larger and contains more lexical richness and syntactic variation than previous closed-domain NLG datasets.

    Communicative Goal

    Producing a text informing/recommending a restaurant, given all and only the attributes specified on the input.

    Sourced from Different Sources

    no

    Language Data

    How was Language Data Obtained?

    Crowdsourced

    Where was it crowdsourced?

    Other crowdworker platform

    Language Producers

    Human references describing the MRs were collected by crowdsourcing on the CrowdFlower (now Appen) platform, with either textual or pictorial MRs as a baseline. The pictorial MRs were used in 20% of cases -- these yield higher lexical variation but introduce noise.

    Topics Covered

    The dataset is focused on descriptions of restaurants.

    Data Validation

    validated by data curator

    Data Preprocessing

    There were basic checks (length, valid characters, repetition).

    Was Data Filtered?

    algorithmically

    Filter Criteria

    The cleaned version of the dataset which we are using in GEM was algorithmically filtered. They used regular expressions to match all human-generated references with a more accurate input when attributes were hallucinated or dropped. Additionally, train-test overlap stemming from the transformation was removed. As a result, this data is much cleaner than the original dataset but not perfect (about 20% of instances may have misaligned slots, compared to 40% of the original data.

    Structured Annotations

    Additional Annotations?

    none

    Annotation Service?

    no

    Consent

    Any Consent Policy?

    yes

    Consent Policy Details

    Since a crowdsourcing platform was used, the involved raters waived their rights to the data and are aware that the produced annotations can be publicly released.

    Private Identifying Information (PII)

    Contains PII?

    no PII

    Justification for no PII

    The dataset is artificial and does not contain any description of people.

    Maintenance

    Any Maintenance Plan?

    no

    Broader Social Context

    Previous Work on the Social Impact of the Dataset

    Usage of Models based on the Data

    no

    Impact on Under-Served Communities

    Addresses needs of underserved Communities?

    no

    Discussion of Biases

    Any Documented Social Biases?

    no

    Are the Language Producers Representative of the Language?

    The source data is generated randomly, so it should not contain biases. The human references may be biased by the workers' demographic, but that was not investigated upon data collection.

    Considerations for Using the Data

    PII Risks and Liability

    Licenses

    Copyright Restrictions on the Dataset

    open license - commercial use allowed

    Copyright Restrictions on the Language Data

    open license - commercial use allowed

    Known Technical Limitations

    Technical Limitations

    The cleaned version still has data points with hallucinated or omitted attributes.

    Unsuited Applications

    The data only pertains to the restaurant domain and the included attributes. A model cannot be expected to handle other domains or attributes.