数据集:
pszemraj/scientific_lay_summarisation-plos-norm
This dataset is a modified version of tomasg25/scientific_lay_summarization and contains scientific lay summaries that have been preprocessed with this code . The preprocessing includes fixing punctuation and whitespace problems, and calculating the token length of each text sample using a tokenizer from the T5 model.
Original dataset details:
The text in both the "article" and "summary" columns was processed to ensure that punctuation and whitespace were consistent. The fix_punct_whitespace function was applied to each text sample to:
The length of each text sample was calculated in terms of tokens using the T5 tokenizer. The calculate_token_length function was used to encode each text sample using the tokenizer and return the number of resulting tokens. The resulting token lengths were added as new columns to the dataframes.
The resulting processed data files are stored in Apache parquet and can be loaded using the pandas' library or the datasets' library from the Hugging Face transformers package. The relevant column names and data types for summarization are
DatasetDict({ train: Dataset({ features: ['article', 'summary', 'section_headings', 'keywords', 'year', 'title', 'article_length', 'summary_length'], num_rows: 24773 }) test: Dataset({ features: ['article', 'summary', 'section_headings', 'keywords', 'year', 'title', 'article_length', 'summary_length'], num_rows: 1376 }) validation: Dataset({ features: ['article', 'summary', 'section_headings', 'keywords', 'year', 'title', 'article_length', 'summary_length'], num_rows: 1376 }) })
Load the desired parquet file(s) using pandas or datasets . Here is an example using pandas :
# download the dataset files by clicking on 'use in datasets' and cloning import pandas as pd # Load train set df = pd.read_parquet("scientific_lay_summarisation-plos-norm/train.parquet") print(df.info())
And here is an example using datasets :
from datasets import load_dataset dataset = load_dataset("pszemraj/scientific_lay_summarisation-plos-norm") train_set = dataset['train'] # Print the first few samples for i in range(5): print(train_set[i])
For train split: