模型:
ctu-aic/mbart25-multilingual-summarization-multilarge-cs
任务:
文生文数据集:
Multilingual_large_dataset_(multilarge) cnc/dm xsum mlsum cnewsum cnc sumeczech 3Asumeczech 3Acnc 3Acnewsum 3Amlsum 3Axsum 3Acnc/dm 3AMultilingual_large_dataset_(multilarge)语言:
cs其他:
mbart 摘要生成 abstractive summarization mbart-large-cc25 Czech text2text generation text generation AutoTrain Compatible abstractive+summarization text2text+generation text+generation许可:
cc-by-sa-4.0这个模型是在多语种大型摘要数据集上 facebook/mbart-large-cc25 进行微调的检查点,主要针对捷克文本以产生多语言摘要。
该模型处理包含八种不同语言的多句摘要。通过添加其他外语文档,并拥有可观量的捷克文档,我们旨在提高捷克语摘要的质量。支持的语言:'en_XX' : 'en', 'de_DE': 'de', 'es_XX': 'es', 'fr_XX':'fr', 'ru_RU':'ru', 'tr_TR':'tr'。
假设您正在使用提供的MultilingualSummarizer.ipynb文件和git存储库中包含的文件。
## Configuration of summarization pipeline # def summ_config(): cfg = OrderedDict([ ## summarization model - checkpoint # ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs # ctu-aic/mt5-base-multilingual-summarization-multilarge-cs # ctu-aic/mbart25-multilingual-summarization-multilarge-cs ("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"), ## language of summarization task # language : string : cs, en, de, fr, es, tr, ru, zh ("language", "en"), ## generation method parameters in dictionary # ("inference_cfg", OrderedDict([ ("num_beams", 4), ("top_k", 40), ("top_p", 0.92), ("do_sample", True), ("temperature", 0.95), ("repetition_penalty", 1.23), ("no_repeat_ngram_size", None), ("early_stopping", True), ("max_length", 128), ("min_length", 10), ])), #texts to summarize values = (list of strings, string, dataset) ("texts", [ "english text1 to summarize", "english text2 to summarize", ] ), #OPTIONAL: Target summaries values = (list of strings, string, None) ('golds', [ "target english text1", "target english text2", ]), #('golds', None), ]) return cfg cfg = summ_config() msummarizer = MultiSummarizer(**cfg) ret = msummarizer(**cfg)
多语种大型摘要数据集包含10个子数据集,主要基于新闻和日常邮件。对于训练,使用了整个训练集和72%的验证集。
Train set: 3 464 563 docs Validation set: 121 260 docs
Stats | fragment | avg document length | avg summary length | Documents | ||||
---|---|---|---|---|---|---|---|---|
dataset | compression | density | coverage | nsent | nwords | nsent | nwords | count |
cnc | 7.388 | 0.303 | 0.088 | 16.121 | 316.912 | 3.272 | 46.805 | 750K |
sumeczech | 11.769 | 0.471 | 0.115 | 27.857 | 415.711 | 2.765 | 38.644 | 1M |
cnndm | 13.688 | 2.983 | 0.538 | 32.783 | 676.026 | 4.134 | 54.036 | 300K |
xsum | 18.378 | 0.479 | 0.194 | 18.607 | 369.134 | 1.000 | 21.127 | 225K |
mlsum/tu | 8.666 | 5.418 | 0.461 | 14.271 | 214.496 | 1.793 | 25.675 | 274K |
mlsum/de | 24.741 | 8.235 | 0.469 | 32.544 | 539.653 | 1.951 | 23.077 | 243K |
mlsum/fr | 24.388 | 2.688 | 0.424 | 24.533 | 612.080 | 1.320 | 26.93 | 425K |
mlsum/es | 36.185 | 3.705 | 0.510 | 31.914 | 746.927 | 1.142 | 21.671 | 291K |
mlsum/ru | 78.909 | 1.194 | 0.246 | 62.141 | 948.079 | 1.012 | 11.976 | 27K |
cnewsum | 20.183 | 0.000 | 0.000 | 16.834 | 438.271 | 1.109 | 21.926 | 304K |
编码器(输入文本)的截断和填充为512个标记,解码器(摘要)为128个标记。
基于交叉熵损失进行训练。
Time: 3 days 8 hours Epochs: 860K steps cca 8 (from 10) GPUs: 4x NVIDIA A100-SXM4-40GB eloss: 2.214 - 1.762 tloss: 3.365 - 1.445
ROUGE | ROUGE-1 | ROUGE-2 | ROUGE-L | ||||||
---|---|---|---|---|---|---|---|---|---|
dataset | Precision | Recall | Fscore | Precision | Recall | Fscore | Precision | Recall | Fscore |
cnc | 27.45 | 24.8 | 25.24 | 9.35 | 8.54 | 8.67 | 20.14 | 18.19 | 18.54 |
sumeczech | 25.38 | 21.61 | 22.66 | 7.71 | 6.67 | 6.96 | 18.76 | 16.02 | 16.78 |
cnndm | 41.97 | 42.61 | 41.05 | 19.64 | 19.88 | 19.16 | 29.38 | 29.85 | 28.73 |
xsum | 39.18 | 39.8 | 38.83 | 16.59 | 16.98 | 16.5 | 31.25 | 31.74 | 30.96 |
mlsum-tu | 51.02 | 47.95 | 47.72 | 36.15 | 34.07 | 33.9 | 44.59 | 41.9 | 41.74 |
mlsum-de | 46.96 | 46.16 | 46.02 | 35.95 | 35.87 | 35.66 | 43.26 | 42.7 | 42.53 |
mlsum-fr | 34.51 | 31.4 | 32.03 | 16.56 | 15.07 | 15.37 | 26.73 | 24.41 | 24.86 |
mlsum-es | 32.62 | 29.66 | 30.21 | 13.3 | 12.2 | 12.39 | 26.24 | 24.02 | 24.4 |
mlsum-ru | 1.25 | 1.54 | 1.31 | 0.46 | 0.46 | 0.44 | 1.25 | 1.54 | 1.31 |
cnewsum | 26.43 | 29.44 | 26.38 | 7.38 | 8.52 | 7.46 | 25.99 | 28.94 | 25.92 |