数据集:

wiki_bio

语言:

en

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

found

源数据集:

original

预印本库:

arxiv:1603.07771
英文

[数据集名称] 的数据集卡片

数据集概要

此数据集包含从维基百科提取的728,321个传记,其中包含传记的第一段和表格信息框。

支持的任务和排行榜

此数据集的主要目的是开发文本生成模型。

语言

英文。

数据集结构

数据实例

需要更多信息

数据字段

单个样本的结构如下所示:

{
   "input_text":{
      "context":"pope michael iii of alexandria\n",
      "table":{
         "column_header":[
            "type",
            "ended",
            "death_date",
            "title",
            "enthroned",
            "name",
            "buried",
            "religion",
            "predecessor",
            "nationality",
            "article_title",
            "feast_day",
            "birth_place",
            "residence",
            "successor"
         ],
         "content":[
            "pope",
            "16 march 907",
            "16 march 907",
            "56th of st. mark pope of alexandria & patriarch of the see",
            "25 april 880",
            "michael iii of alexandria",
            "monastery of saint macarius the great",
            "coptic orthodox christian",
            "shenouda i",
            "egyptian",
            "pope michael iii of alexandria\n",
            "16 -rrb- march -lrb- 20 baramhat in the coptic calendar",
            "egypt",
            "saint mark 's church",
            "gabriel i"
         ],
         "row_number":[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
      }
   },
   "target_text":"pope michael iii of alexandria -lrb- also known as khail iii -rrb- was the coptic pope of alexandria and patriarch of the see of st. mark -lrb- 880 -- 907 -rrb- .\nin 882 , the governor of egypt , ahmad ibn tulun , forced khail to pay heavy contributions , forcing him to sell a church and some attached properties to the local jewish community .\nthis building was at one time believed to have later become the site of the cairo geniza .\n"
}

其中,"table"字段中存储了所有维基百科信息框的信息(信息框的标题存储在"column_header"中,内容存储在"content"字段中)。

数据划分

  • 训练集:582,659个样本。
  • 测试集:72,831个样本。
  • 验证集:72,831个样本。

数据集创建

策划理由

[需要更多信息]

源数据

此数据集在论文《用于传记领域的结构化数据的神经文本生成》 (arxiv link) 中宣布,并存储在 this 存储库(由DavidGrangier拥有)中。

初始数据收集和标准化

[需要更多信息]

谁是源语言生产者?

[需要更多信息]

注释

注释过程

[需要更多信息]

谁是注释者?

[需要更多信息]

个人信息和敏感信息

[需要更多信息]

使用数据时的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

[需要更多信息]

许可信息

此数据集使用 Creative Commons CC BY-SA 3.0 许可证发布。

引用信息

若要以BibTex格式引用原始论文,请参考:

@article{DBLP:journals/corr/LebretGA16,
  author    = {R{\'{e}}mi Lebret and
               David Grangier and
               Michael Auli},
  title     = {Generating Text from Structured Data with Application to the Biography
               Domain},
  journal   = {CoRR},
  volume    = {abs/1603.07771},
  year      = {2016},
  url       = {http://arxiv.org/abs/1603.07771},
  archivePrefix = {arXiv},
  eprint    = {1603.07771},
  timestamp = {Mon, 13 Aug 2018 16:48:30 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/LebretGA16.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

贡献

感谢 @alejandrocros 添加此数据集。