数据集:

ccdv/WCEP-10

英文

WCEP10数据集用于摘要

PRIMERA 复制的摘要数据集

如果您在summarization_name_mapping变量中添加此行,该数据集与Transformers中的 run_summarization.py 脚本兼容:

"ccdv/WCEP-10": ("document", "summary")

配置

4个可配置选项:

  • roberta会使用"</s>"将文档连接起来(默认)
  • newline会使用"\n"将文档连接起来
  • bert会使用"[SEP]"将文档连接起来
  • list会返回文档列表而不是字符串

数据字段

  • id:论文id
  • document:包含一组文档正文的字符串/列表
  • summary:包含摘要的字符串

数据拆分

此数据集有3个拆分:训练集、验证集和测试集。

Dataset Split Number of Instances
Train 8158
Validation 1020
Test 1022

引用原文

@article{DBLP:journals/corr/abs-2005-10070,
    author    = {Demian Gholipour Ghalandari and
                Chris Hokamp and
                Nghia The Pham and
                John Glover and
                Georgiana Ifrim},
    title     = {A Large-Scale Multi-Document Summarization Dataset from the Wikipedia
                Current Events Portal},
    journal   = {CoRR},
    volume    = {abs/2005.10070},
    year      = {2020},
    url       = {https://arxiv.org/abs/2005.10070},
    eprinttype = {arXiv},
    eprint    = {2005.10070},
    timestamp = {Fri, 22 May 2020 16:21:28 +0200},
    biburl    = {https://dblp.org/rec/journals/corr/abs-2005-10070.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
    }

@article{DBLP:journals/corr/abs-2110-08499,
    author    = {Wen Xiao and
                Iz Beltagy and
                Giuseppe Carenini and
                Arman Cohan},
    title     = {{PRIMER:} Pyramid-based Masked Sentence Pre-training for Multi-document
                Summarization},
    journal   = {CoRR},
    volume    = {abs/2110.08499},
    year      = {2021},
    url       = {https://arxiv.org/abs/2110.08499},
    eprinttype = {arXiv},
    eprint    = {2110.08499},
    timestamp = {Fri, 22 Oct 2021 13:33:09 +0200},
    biburl    = {https://dblp.org/rec/journals/corr/abs-2110-08499.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}