mt5-base-multilingual-summarization-multilarge-cs

这个模型是在以捷克文本为重点的多语种大型摘要数据集上，基于 google/mt5-base 进行微调的检查点，用于生成多语种摘要。

任务

该模型处理了八种不同语言的多句子摘要。通过添加其他外语文档，并且有大量的捷克文档，我们旨在改善捷克语言的摘要。

支持的语言：'cs'：'<extra_id_0>', 'en'：'<extra_id_1>', 'de'：'<extra_id_2>', 'es'：'<extra_id_3>', 'fr'：'<extra_id_4>', 'ru'：'<extra_id_5>', 'tu'：'<extra_id_6>', 'zh'：'<extra_id_7>'

#使用方法

## Configuration of summarization pipeline
#
def summ_config():
    cfg = OrderedDict([
        
        ## summarization model - checkpoint
        #   ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
        #   ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
        #   ctu-aic/mbart25-multilingual-summarization-multilarge-cs
        ("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
        
        ## language of summarization task
        #   language : string : cs, en, de, fr, es, tr, ru, zh
        ("language", "en"), 
        
        ## generation method parameters in dictionary
        #
        ("inference_cfg", OrderedDict([
            ("num_beams", 4),
            ("top_k", 40),
            ("top_p", 0.92),
            ("do_sample", True),
            ("temperature", 0.95),
            ("repetition_penalty", 1.23),
            ("no_repeat_ngram_size", None),
            ("early_stopping", True),
            ("max_length", 128),
            ("min_length", 10),
        ])),
        #texts to summarize values = (list of strings, string, dataset)
        ("texts",
            [
               "english text1 to summarize",
               "english text2 to summarize",
            ]
        ),
        #OPTIONAL: Target summaries values = (list of strings, string, None)
        ('golds',
         [
               "target english text1",
               "target english text2",
         ]),
        #('golds', None),
    ])
    return cfg

cfg = summ_config()
mSummarize = MultiSummarizer(**cfg)
summaries,scores = mSummarize(**cfg)

数据集

多语种大型摘要数据集由10个子数据集组成，主要基于新闻和每日邮件。训练过程中使用了整个训练集和验证集的72%。

Train set:        3 464 563 docs
Validation set:     121 260 docs

Stats	fragment	avg document length	avg summary length	Documents
dataset	compression	density	coverage	nsent	nwords	nsent	nwords	count
cnc	7.388	0.303	0.088	16.121	316.912	3.272	46.805	750K
sumeczech	11.769	0.471	0.115	27.857	415.711	2.765	38.644	1M
cnndm	13.688	2.983	0.538	32.783	676.026	4.134	54.036	300K
xsum	18.378	0.479	0.194	18.607	369.134	1.000	21.127	225K
mlsum/tu	8.666	5.418	0.461	14.271	214.496	1.793	25.675	274K
mlsum/de	24.741	8.235	0.469	32.544	539.653	1.951	23.077	243K
mlsum/fr	24.388	2.688	0.424	24.533	612.080	1.320	26.93	425K
mlsum/es	36.185	3.705	0.510	31.914	746.927	1.142	21.671	291K
mlsum/ru	78.909	1.194	0.246	62.141	948.079	1.012	11.976	27K
cnewsum	20.183	0.000	0.000	16.834	438.271	1.109	21.926	304K

分词

编码器（输入文本）设定为512个标记，解码器（摘要）设定为128个标记。

训练

基于交叉熵损失进行训练。

Time: 3 days 20 hours
Epochs: 1080K steps = 10 (from 10)
GPUs: 4x NVIDIA A100-SXM4-40GB
eloss: 2.462 - 1.797
tloss: 17.322 - 1.578

ROUGE结果：按照各个数据集的测试集计算

ROUGE	ROUGE-1	ROUGE-2	ROUGE-L
Precision	Recall	Fscore	Precision	Recall	Fscore	Precision	Recall	Fscore
cnc	30.62	19.83	23.44	9.94	6.52	7.67	22.92	14.92	17.6
sumeczech	27.57	17.6	20.85	8.12	5.23	6.17	20.84	13.38	15.81
cnndm	43.83	37.73	39.34	20.81	17.82	18.6	31.8	27.42	28.55
xsum	41.63	30.54	34.56	16.13	11.76	13.33	33.65	24.74	27.97
mlsum-tu-	54.4	43.29	46.2	38.78	31.31	33.23	48.18	38.44	41
mlsum-de	47.94	44.14	45.11	36.42	35.24	35.42	44.43	41.42	42.16
mlsum-fr	35.26	25.96	28.98	16.72	12.35	13.75	28.06	20.75	23.12
mlsum-es	33.37	24.84	27.52	13.29	10.05	11.05	27.63	20.69	22.87
mlsum-ru	0.79	0.66	0.66	0.26	0.2	0.22	0.79	0.66	0.65
cnewsum	24.49	24.38	23.23	6.48	6.7	6.24	24.18	24.04	22.91

用法

soon

作者:

AI Center FEE CTU

数据集大小:

2.18 GB