中文

Model Card for Model ID

This model is fine-tuned version of DeltaLM-base on the XLSum dataset , aiming for abstractive multilingual summarization.

It achieves the following results on the evaluation set:

  • rouge-1: 18.2
  • rouge-2: 7.6
  • rouge-l: 14.9
  • rouge-lsum: 14.7

Dataset desctiption

XLSum dataset is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 45 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.

Languages

  • amharic
  • arabic
  • azerbaijani
  • bengali
  • burmese
  • chinese_simplified
  • chinese_traditional
  • english
  • french
  • gujarati
  • hausa
  • hindi
  • igbo
  • indonesian
  • japanese
  • kirundi
  • korean
  • kyrgyz
  • marathi
  • nepali
  • oromo
  • pashto
  • persian
  • pidgin
  • portuguese
  • punjabi
  • russian
  • scottish_gaelic
  • serbian_cyrillic
  • serbian_latin
  • sinhala
  • somali
  • spanish
  • swahili
  • tamil
  • telugu
  • thai
  • tigrinya
  • turkish
  • ukrainian
  • urdu
  • uzbek
  • vietnamese
  • welsh
  • yoruba

Training hyperparameters

The model trained with a p4d.24xlarge instance on aws sagemaker, with the following config:

  • model: deltalm base
  • batch size: 8
  • learning rate: 1e-5
  • number of epochs: 3
  • warmup steps: 500
  • weight decay: 0.01

Inference example

from modeling_deltalm import DeltalmForConditionalGeneration  # download from https://huggingface.co/hhhhzy/deltalm-base-xlsum/blob/main/modeling_deltalm.py
from configuration_deltalm import DeltalmConfig      # download from https://huggingface.co/hhhhzy/deltalm-base-xlsum/blob/main/configuration_deltalm.py
from transformers import AutoTokenizer                        

model = DeltalmForConditionalGeneration.from_pretrained("hhhhzy/deltalm-base-xlsum")
tokenizer = AutoTokenizer.from_pretrained("hhhhzy/deltalm-base-xlsum")

text = "The USA’s biggest sports league, the NFL, has extended its partnership with Amazon Prime, granting the streaming platform an additional live game on ‘black Friday’, the day after Thanksgiving. The additional game, added from 2023, builds on Amazon Prime’s package of ‘Thursday night football’ live rights (secured in an 11-year deal).\\nOn the surface, the deal makes sense because it gives Amazon Prime additional game time during the holiday season. But there is a deeper motivation at play. Black Friday is also regarded as the starting point of the pre-Christmas shopping season. Amazon has worked hard to leverage its sports rights in a way that benefits its ecommerce platform, so the addition of this fixture will boost that strategic goal.\\nIt’s unusual for sports rights holders to utilise their inventory in such a granular way – but it does suggest a shift towards a more data-driven approach to negotiations. For NFL, the deal means it now has partnerships with NBC, CBS, Fox and Amazon across the Thanksgiving period. Amazon Prime is currently in the NFL’s good books, helping revitalise the Thursday night slot through its marketing support and onscreen investment. Around 10 million people in the US are watching live fixtures each week."
inputs = tokenizer(text, max_length=512, return_tensors="pt")

generate_ids = model.generate(inputs["input_ids"], min_length=32, max_length=128)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]