数据集:

pszemraj/scientific_lay_summarisation-plos-norm

中文

scientific_lay_summarisation - PLOS - normalized

This dataset is a modified version of tomasg25/scientific_lay_summarization and contains scientific lay summaries that have been preprocessed with this code . The preprocessing includes fixing punctuation and whitespace problems, and calculating the token length of each text sample using a tokenizer from the T5 model.

Original dataset details:

Data Cleaning

The text in both the "article" and "summary" columns was processed to ensure that punctuation and whitespace were consistent. The fix_punct_whitespace function was applied to each text sample to:

  • Remove spaces before punctuation marks (except for parentheses)
  • Add a space after punctuation marks (except for parentheses) if missing
  • Handle spaces around parentheses
  • Add a space after a closing parenthesis if followed by a word or opening parenthesis
  • Handle spaces around quotation marks
  • Handle spaces around single quotes
  • Handle comma in numbers

Tokenization

The length of each text sample was calculated in terms of tokens using the T5 tokenizer. The calculate_token_length function was used to encode each text sample using the tokenizer and return the number of resulting tokens. The resulting token lengths were added as new columns to the dataframes.

Data Format

The resulting processed data files are stored in Apache parquet and can be loaded using the pandas' library or the datasets' library from the Hugging Face transformers package. The relevant column names and data types for summarization are

DatasetDict({
    train: Dataset({
        features: ['article', 'summary', 'section_headings', 'keywords', 'year', 'title', 'article_length', 'summary_length'],
        num_rows: 24773
    })
    test: Dataset({
        features: ['article', 'summary', 'section_headings', 'keywords', 'year', 'title', 'article_length', 'summary_length'],
        num_rows: 1376
    })
    validation: Dataset({
        features: ['article', 'summary', 'section_headings', 'keywords', 'year', 'title', 'article_length', 'summary_length'],
        num_rows: 1376
    })
})

Usage

Load the desired parquet file(s) using pandas or datasets . Here is an example using pandas :

# download the dataset files by clicking on 'use in datasets' and cloning 
import pandas as pd

# Load train set
df = pd.read_parquet("scientific_lay_summarisation-plos-norm/train.parquet")
print(df.info())

And here is an example using datasets :

from datasets import load_dataset

dataset = load_dataset("pszemraj/scientific_lay_summarisation-plos-norm")
train_set = dataset['train']
# Print the first few samples
for i in range(5):
    print(train_set[i])

Token Lengths

For train split: