数据集:

totto

语言:

en

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:2004.14373
英文

ToTTo数据集的数据卡

数据集概述

ToTTo是一个开放领域的英语表格到文本数据集,包含超过120,000个训练样例,提出了一个受控生成任务:给定一个维基百科表格和一组突出显示的表格单元格,生成一个句子描述。

支持的任务和排行榜

[需要更多信息]

语言

[需要更多信息]

数据集结构

数据实例

下面是一个示例训练集

{'example_id': '1762238357686640028',
 'highlighted_cells': [[13, 2]],
 'id': 0,
 'overlap_subset': 'none',
 'sentence_annotations': {'final_sentence': ['A Favorita is the telenovela aired in the 9 pm timeslot.'],
  'original_sentence': ['It is also the first telenovela by the writer to air in the 9 pm timeslot.'],
  'sentence_after_ambiguity': ['A Favorita is the telenovela aired in the 9 pm timeslot.'],
  'sentence_after_deletion': ['It is the telenovela air in the 9 pm timeslot.']},
 'table': [[{'column_span': 1, 'is_header': True, 'row_span': 1, 'value': '#'},
   {'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Run'},
   {'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Title'},
   {'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Chapters'},
   {'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Author'},
   {'column_span': 1, 'is_header': True, 'row_span': 1, 'value': 'Director'},
   {'column_span': 1,
    'is_header': True,
    'row_span': 1,
    'value': 'Ibope Rating'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '59'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'June 5, 2000— February 2, 2001'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Laços de Família'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '209'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Manoel Carlos'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Ricardo Waddington'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '44.9'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '60'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'February 5, 2001— September 28, 2001'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Porto dos Milagres'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '203'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Aguinaldo Silva Ricardo Linhares'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Marcos Paulo Simões'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '44.6'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '61'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'October 1, 2001— June 14, 2002'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': 'O Clone'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '221'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Glória Perez'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Jayme Monjardim'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '47.0'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '62'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'June 17, 2002— February 14, 2003'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': 'Esperança'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '209'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Benedito Ruy Barbosa'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Luiz Fernando'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '37.7'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '63'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'February 17, 2003— October 10, 2003'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Mulheres Apaixonadas'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '203'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Manoel Carlos'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Ricardo Waddington'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '46.6'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '64'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'October 13, 2003— June 25, 2004'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Celebridade'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '221'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Gilberto Braga'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Dennis Carvalho'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '46.0'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '65'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'June 28, 2004— March 11, 2005'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Senhora do Destino'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '221'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Aguinaldo Silva'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': 'Wolf Maya'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '50.4'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '66'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'March 14, 2005— November 4, 2005'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': 'América'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '203'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Glória Perez'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Jayme Monjardim Marcos Schechtman'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '49.4'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '67'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'November 7, 2005— July 7, 2006'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': 'Belíssima'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '209'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Sílvio de Abreu'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Denise Saraceni'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '48.5'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '68'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'July 10, 2006— March 2, 2007'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Páginas da Vida'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '203'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Manoel Carlos'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Jayme Monjardim'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '46.8'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '69'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'March 5, 2007— September 28, 2007'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Paraíso Tropical'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '179'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Gilberto Braga Ricardo Linhares'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Dennis Carvalho'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '42.8'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '70'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'October 1, 2007— May 31, 2008'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Duas Caras'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '210'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Aguinaldo Silva'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': 'Wolf Maya'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '41.1'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '71'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'June 2, 2008— January 16, 2009'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'A Favorita'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '197'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'João Emanuel Carneiro'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Ricardo Waddington'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '39.5'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '72'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'January 19, 2009— September 11, 2009'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Caminho das Índias'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '203'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Glória Perez'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Marcos Schechtman'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '38.8'}],
  [{'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '73'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'September 14, 2009— May 14, 2010'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Viver a Vida'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '209'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Manoel Carlos'},
   {'column_span': 1,
    'is_header': False,
    'row_span': 1,
    'value': 'Jayme Monjardim'},
   {'column_span': 1, 'is_header': False, 'row_span': 1, 'value': '35.6'}]],
 'table_page_title': 'List of 8/9 PM telenovelas of Rede Globo',
 'table_section_text': '',
 'table_section_title': '2000s',
 'table_webpage_url': 'http://en.wikipedia.org/wiki/List_of_8/9_PM_telenovelas_of_Rede_Globo'}

请注意,在测试集中不提供句子注释,因此可以安全地忽略sentence_annotations中的值。

数据字段

  • table_webpage_url(str):表格网页URL。
  • table_page_title(str):关于表格的元数据,包含有关表格的上下文信息。
  • table_section_title(str):关于表格的元数据,包含有关表格的上下文信息。
  • table_section_text(str):关于表格的元数据,包含有关表格的上下文信息。
  • table(List[List[Dict]]):外部列表表示行,内部列表表示列。每个Dict具有以下字段:
    • column_span(int)
    • is_header(bool)
    • row_span(int)
    • value(str)
  • highlighted_cells(List[[row_index, column_index]]):每个[row_index, column_index]对指示table[row_index][column_index]被突出显示。
  • example_id(int):此示例的唯一ID。
  • sentence_annotations:包含original_sentence和按顺序执行的修订句子序列,以生成final_sentence。

数据拆分

DatasetDict({
    train: Dataset({
        features: ['id', 'table_page_title', 'table_webpage_url', 'table_section_title', 'table_section_text', 'table', 'highlighted_cells', 'example_id', 'sentence_annotations', 'overlap_subset'],
        num_rows: 120761
    })
    validation: Dataset({
        features: ['id', 'table_page_title', 'table_webpage_url', 'table_section_title', 'table_section_text', 'table', 'highlighted_cells', 'example_id', 'sentence_annotations', 'overlap_subset'],
        num_rows: 7700
    })
    test: Dataset({
        features: ['id', 'table_page_title', 'table_webpage_url', 'table_section_title', 'table_section_text', 'table', 'highlighted_cells', 'example_id', 'sentence_annotations', 'overlap_subset'],
        num_rows: 7700
    })
})

数据集创建

策划理由

[需要更多信息]

源数据

[需要更多信息]

初始数据收集和规范化

[需要更多信息]

谁是源语言的生产者?

[需要更多信息]

注释

[需要更多信息]

注释过程

[需要更多信息]

谁是注释者?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

[需要更多信息]

许可信息

[需要更多信息]

引用信息

@inproceedings{parikh2020totto,
  title={{ToTTo}: A Controlled Table-To-Text Generation Dataset},
  author={Parikh, Ankur P and Wang, Xuezhi and Gehrmann, Sebastian and Faruqui, Manaal and Dhingra, Bhuwan and Yang, Diyi and Das, Dipanjan},
  booktitle={Proceedings of EMNLP},
  year={2020}
 }

贡献

感谢 @abhishekkrthakur 添加了这个数据集。