数据集:
ccdv/WCEP-10
从 PRIMERA 复制的摘要数据集
如果您在summarization_name_mapping变量中添加此行,该数据集与Transformers中的 run_summarization.py 脚本兼容:
"ccdv/WCEP-10": ("document", "summary")
4个可配置选项:
此数据集有3个拆分:训练集、验证集和测试集。
Dataset Split | Number of Instances |
---|---|
Train | 8158 |
Validation | 1020 |
Test | 1022 |
@article{DBLP:journals/corr/abs-2005-10070, author = {Demian Gholipour Ghalandari and Chris Hokamp and Nghia The Pham and John Glover and Georgiana Ifrim}, title = {A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal}, journal = {CoRR}, volume = {abs/2005.10070}, year = {2020}, url = {https://arxiv.org/abs/2005.10070}, eprinttype = {arXiv}, eprint = {2005.10070}, timestamp = {Fri, 22 May 2020 16:21:28 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2005-10070.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{DBLP:journals/corr/abs-2110-08499, author = {Wen Xiao and Iz Beltagy and Giuseppe Carenini and Arman Cohan}, title = {{PRIMER:} Pyramid-based Masked Sentence Pre-training for Multi-document Summarization}, journal = {CoRR}, volume = {abs/2110.08499}, year = {2021}, url = {https://arxiv.org/abs/2110.08499}, eprinttype = {arXiv}, eprint = {2110.08499}, timestamp = {Fri, 22 Oct 2021 13:33:09 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2110-08499.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }