通过编译一份具有纯化学重点的数据集,我们介绍了一个纯化学数据集。对于每个期刊,我们使用可用的API从期刊的开放获取部分或使用 Selenium Chrome WebDriver 来进行Web爬虫,下载了全文文章的PDF。然后,我们使用本地安装的 client 对每个PDF进行处理,提取带有章节的纯文本段落。
下表列出了从哪些期刊获得了开放获取的文章,以及处理的论文数量。
对于所有期刊,我们在其他学科的论文(例如PubMed)也可用时,仅选择具有化学主题的论文。
Source | # of Articles |
---|---|
Beilstein | 1,829 |
Chem Cell | 546 |
ChemRxiv | 12,231 |
Chemistry Open | 398 |
Nature Communications Chemistry | 572 |
PubMed Author Manuscript | 57,680 |
PubMed Open Access | 29,540 |
Royal Society of Chemistry (RSC) | 9,334 |
Scientific Reports - Nature | 6,826 |
英语
Column | Description |
---|---|
uuid | Unique Identifier for the Example |
title | Title of the Article |
article_source | Open Source Journal (see above for list) |
abstract | Abstract (summary reference) |
sections | Full-text sections from the main body of paper (<!> indicates section boundaries) |
headers | Corresponding section headers for sections field (<!> delimited) |
source_toks | Aggregate number of tokens across sections |
target_toks | Number of tokens in the abstract |
compression | Ratio of source_toks to target_toks |
请参阅 https://github.com/griff4692/calibrating-summaries/blob/master/preprocess/preprocess.py 中的load_chemistry()函数,用作摘要数据集的预处理。输入为sections和headers,目标为abstract。
Split | Count |
---|---|
train | 115,956 |
validation | 1,000 |
test | 2,000 |
@article{adams2023desired, title={What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization}, author={Adams, Griffin and Nguyen, Bichlien H and Smith, Jake and Xia, Yingce and Xie, Shufang and Ostropolets, Anna and Deb, Budhaditya and Chen, Yuan-Jyue and Naumann, Tristan and Elhadad, No{\'e}mie}, journal={arXiv preprint arXiv:2305.07615}, year={2023} }