数据集:

griffin/ChemSum

英文

ChemSum 数据集卡片

ChemSum 描述

ChemSum 概述

通过编译一份具有纯化学重点的数据集,我们介绍了一个纯化学数据集。对于每个期刊,我们使用可用的API从期刊的开放获取部分或使用 Selenium Chrome WebDriver 来进行Web爬虫,下载了全文文章的PDF。然后,我们使用本地安装的 client 对每个PDF进行处理,提取带有章节的纯文本段落。

下表列出了从哪些期刊获得了开放获取的文章,以及处理的论文数量。

对于所有期刊,我们在其他学科的论文(例如PubMed)也可用时,仅选择具有化学主题的论文。

Source # of Articles
Beilstein 1,829
Chem Cell 546
ChemRxiv 12,231
Chemistry Open 398
Nature Communications Chemistry 572
PubMed Author Manuscript 57,680
PubMed Open Access 29,540
Royal Society of Chemistry (RSC) 9,334
Scientific Reports - Nature 6,826

语言

英语

数据集结构

数据字段

Column Description
uuid Unique Identifier for the Example
title Title of the Article
article_source Open Source Journal (see above for list)
abstract Abstract (summary reference)
sections Full-text sections from the main body of paper (<!> indicates section boundaries)
headers Corresponding section headers for sections field (<!> delimited)
source_toks Aggregate number of tokens across sections
target_toks Number of tokens in the abstract
compression Ratio of source_toks to target_toks

请参阅 https://github.com/griff4692/calibrating-summaries/blob/master/preprocess/preprocess.py 中的load_chemistry()函数,用作摘要数据集的预处理。输入为sections和headers,目标为abstract。

数据拆分

Split Count
train 115,956
validation 1,000
test 2,000

引用信息

@article{adams2023desired,
  title={What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization},
  author={Adams, Griffin and Nguyen, Bichlien H and Smith, Jake and Xia, Yingce and Xie, Shufang and Ostropolets, Anna and Deb, Budhaditya and Chen, Yuan-Jyue and Naumann, Tristan and Elhadad, No{\'e}mie},
  journal={arXiv preprint arXiv:2305.07615},
  year={2023}
}