德语BERT2BERT在MLSUM DE上进行微调用于摘要生成

模型

bert-base-german-cased (BERT检查点)

数据集

MLSUM是第一个大规模的多语言摘要数据集。从在线报纸获得，它包含五种不同语言的超过150万个文章/摘要对，即法语、德语、西班牙语、俄语和土耳其语。与来自知名CNN/Daily Mail数据集的英语报纸一起，收集到的数据形成了一个大规模的多语言数据集，可为文本摘要研究提供新的方向。我们基于最先进的系统进行跨语言的比较分析。这些分析突出了现有的偏见，这也促使使用多语言数据集。 MLSUM de

结果

Set	Metric	# Score
Test	Rouge2 - mid -precision	33.04
Test	Rouge2 - mid - recall	33.83
Test	Rouge2 - mid - fmeasure	33.15

用法

import torch
from transformers import BertTokenizerFast, EncoderDecoderModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'mrm8488/bert2bert_shared-german-finetuned-summarization'
tokenizer = BertTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)
def generate_summary(text):
   inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
   input_ids = inputs.input_ids.to(device)
   attention_mask = inputs.attention_mask.to(device)
   output = model.generate(input_ids, attention_mask=attention_mask)
   return tokenizer.decode(output[0], skip_special_tokens=True)
   
text = "Your text here..."

generate_summary(text)

由 Manuel Romero/@mrm8488 创建，得到 Narrativa 的支持，在西班牙用♥制作

作者:

Manuel Romero

数据集大小:

1.03 GB