数据集:

griffin/ChemSum

中文

Dataset Card for ChemSum

ChemSum Description

ChemSum Summary

We introduce a dataset with a pure chemistry focus by compiling a list of chemistry academic journals with Open-Access articles. For each journal, we downloaded full-text article PDFs from the Open-Access portion of the journal using available APIs, or scraping this content using Selenium Chrome WebDriver . Each PDF was processed with Grobid via a locally installed client to extract free-text paragraphs with sections.

The table below shows the journals from which Open Access articles were sourced, as well as the number of papers processed.

For all journals, we filtered for papers with the provided topic of Chemistry when papers from other disciplines were also available (e.g. PubMed).

Source # of Articles
Beilstein 1,829
Chem Cell 546
ChemRxiv 12,231
Chemistry Open 398
Nature Communications Chemistry 572
PubMed Author Manuscript 57,680
PubMed Open Access 29,540
Royal Society of Chemistry (RSC) 9,334
Scientific Reports - Nature 6,826

Languages

English

Dataset Structure

Data Fields

Column Description
uuid Unique Identifier for the Example
title Title of the Article
article_source Open Source Journal (see above for list)
abstract Abstract (summary reference)
sections Full-text sections from the main body of paper (<!> indicates section boundaries)
headers Corresponding section headers for sections field (<!> delimited)
source_toks Aggregate number of tokens across sections
target_toks Number of tokens in the abstract
compression Ratio of source_toks to target_toks

Please refer to load_chemistry() in https://github.com/griff4692/calibrating-summaries/blob/master/preprocess/preprocess.py for pre-processing as a summarization dataset. The inputs are sections and headers and the targets is the abstract .

Data Splits

Split Count
train 115,956
validation 1,000
test 2,000

Citation Information

@article{adams2023desired,
  title={What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization},
  author={Adams, Griffin and Nguyen, Bichlien H and Smith, Jake and Xia, Yingce and Xie, Shufang and Ostropolets, Anna and Deb, Budhaditya and Chen, Yuan-Jyue and Naumann, Tristan and Elhadad, No{\'e}mie},
  journal={arXiv preprint arXiv:2305.07615},
  year={2023}
}