Spanish RoBERTa2RoBERTa (roberta-base-bne)在MLSUM ES上进行摘要微调

模型

BSC-TeMU/roberta-base-bne （RoBERTa检查点）

数据集

MLSUM是第一个大规模的多语言摘要数据集。从在线新闻中获得，该数据集包含五种不同语言（法语、德语、西班牙语、俄语、土耳其语）的150万+文章/摘要对。与来自流行的CNN/Daily mail数据集的英语新闻一起，收集到的数据形成了一个大规模的多语言数据集，为文本摘要研究提供了新的方向。我们报告了基于最先进系统的跨语言比较分析。通过这些分析，我们揭示了现有的偏见，这促使我们使用多语言数据集。

MLSUM es

结果

Set	Metric	Value
Test	Rouge2 - mid -precision	11.42
Test	Rouge2 - mid - recall	10.58
Test	Rouge2 - mid - fmeasure	10.69
Test	Rouge1 - fmeasure	28.83
Test	RougeL - fmeasure	23.15

使用HF/metrics计算的原始度量 rouge：

rouge = datasets.load_metric("rouge")
rouge.compute(predictions=results["pred_summary"], references=results["summary"])

{'rouge1': AggregateScore(low=Score(precision=0.30393366820245, recall=0.27905239591639935, fmeasure=0.283148902808752), mid=Score(precision=0.3068521142101569, recall=0.2817252494122592, fmeasure=0.28560373425206464), high=Score(precision=0.30972608774202665, recall=0.28458152325781716, fmeasure=0.2883786700591887)),
 'rougeL': AggregateScore(low=Score(precision=0.24184668819794716, recall=0.22401171380621518, fmeasure=0.22624104698839514), mid=Score(precision=0.24470388406868163, recall=0.22665793214539162, fmeasure=0.2289118878817394), high=Score(precision=0.2476594458951327, recall=0.22932683203591905, fmeasure=0.23153001570662513))}
 
rouge.compute(predictions=results["pred_summary"], references=results["summary"], rouge_types=["rouge2"])["rouge2"].mid

Score(precision=0.11423200347113865, recall=0.10588038944902506, fmeasure=0.1069921217219595)

用法

import torch
from transformers import RobertaTokenizerFast, EncoderDecoderModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'Narrativa/bsc_roberta2roberta_shared-spanish-finetuned-mlsum-summarization'
tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)

def generate_summary(text):

   inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
   input_ids = inputs.input_ids.to(device)
   attention_mask = inputs.attention_mask.to(device)
   output = model.generate(input_ids, attention_mask=attention_mask)
   return tokenizer.decode(output[0], skip_special_tokens=True)
   
text = "Your text here..."
generate_summary(text)

创建者： Narrativa

关于Narrativa：自然语言生成（NLG）| 我们的基于机器学习的平台Gabriele构建和部署自然语言解决方案。#NLG #AI

作者:

Narrativa

数据集大小:

589.02 MB