模型:
pszemraj/led-large-book-summary
This model is a fine-tuned version of allenai/led-large-16384 on the BookSum dataset ( kmfoda/booksum ). It aims to generalize well and be useful in summarizing lengthy text for both academic and everyday purposes.
Note: Due to inference API timeout constraints, outputs may be truncated before the fully summary is returned (try python or the demo)
To improve summary quality, use encoder_no_repeat_ngram_size=3 when calling the pipeline object. This setting encourages the model to utilize new vocabulary and construct an abstractive summary.
Load the model into a pipeline object:
import torch from transformers import pipeline hf_name = 'pszemraj/led-large-book-summary' summarizer = pipeline( "summarization", hf_name, device=0 if torch.cuda.is_available() else -1, )
Feed the text into the pipeline object:
wall_of_text = "your words here" result = summarizer( wall_of_text, min_length=16, max_length=256, no_repeat_ngram_size=3, encoder_no_repeat_ngram_size=3, repetition_penalty=3.5, num_beams=4, early_stopping=True, )
Important: For optimal summary quality, use the global attention mask when decoding, as demonstrated in this community notebook , see the definition of generate_answer(batch) .
If you're facing computing constraints, consider using the base version pszemraj/led-base-book-summary .
The model was fine-tuned on the booksum dataset. During training, the chapter was the input col, while the summary_text was the output.
Fine-tuning was run on the BookSum dataset across 13+ epochs. Notably, the final four epochs combined the training and validation sets as 'train' to enhance generalization.
The training process involved different settings across stages:
To streamline the process of using this and other models, I've developed a Python package utility named textsum . This package offers simple interfaces for applying summarization models to text documents of arbitrary length.
Install TextSum:
pip install textsum
Then use it in Python with this model:
from textsum.summarize import Summarizer model_name = "pszemraj/led-large-book-summary" summarizer = Summarizer( model_name_or_path=model_name, # you can use any Seq2Seq model on the Hub token_batch_length=4096, # tokens to batch summarize at a time, up to 16384 ) long_string = "This is a long string of text that will be summarized." out_str = summarizer.summarize_string(long_string) print(f"summary: {out_str}")
Currently implemented interfaces include a Python API, a Command-Line Interface (CLI), and a demo/web UI.
For detailed explanations and documentation, check the README or the wiki
Check out these other related models, also trained on the BookSum dataset:
There are also other variants on other datasets etc on my hf profile, feel free to try them out :)