模型:
Voicelab/vlt5-base-keywords
我们的vlT5模型是基于编码器-解码器架构的关键词生成模型,使用了Google提出的Transformer块( https://huggingface.co/t5-base )。vlT5在科学文章语料库上进行了训练,根据文章摘要和标题的拼接来预测给定集合的关键词。它仅基于摘要生成了描述文章内容的精确但不总是完整的关键词。
使用vlT5-base-keywords生成的关键词:编码器-解码器架构,关键词生成
在演示模型上的结果(不同的生成方法,每种语言一个模型):
我们的vlT5模型是基于编码器-解码器架构的关键词生成模型,使用了Google提出的Transformer块( https://huggingface.co/t5-base )。vlT5在科学文章语料库上进行了训练,根据文章摘要和标题的拼接来预测给定集合的关键词。它仅基于摘要生成了描述文章内容的精确但不总是完整的关键词。
使用vlT5-base-keywords生成的关键词:编码器-解码器架构,vlT5,关键词生成,科学文章语料库
vlT5模型的最大优势在于其可迁移性,适用于所有领域和类型的文本。缺点是文本长度和关键词数量与训练数据相似:摘要长度的文本片段生成大约3到5个关键词。它可以进行萃取和摘录处理。较长的文本必须分割成较小的片段,然后传递给模型。
该模型是在POSMAC语料库上进行训练的。波兰开放科学元数据语料库(POSMAC)是CURLICAT项目中收集的216,214篇科学出版物摘要的集合。
Domains | Documents | With keywords |
---|---|---|
Engineering and technical sciences | 58 974 | 57 165 |
Social sciences | 58 166 | 41 799 |
Agricultural sciences | 29 811 | 15 492 |
Humanities | 22 755 | 11 497 |
Exact and natural sciences | 13 579 | 9 185 |
Humanities, Social sciences | 12 809 | 7 063 |
Medical and health sciences | 6 030 | 3 913 |
Medical and health sciences, Social sciences | 828 | 571 |
Humanities, Medical and health sciences, Social sciences | 601 | 455 |
Engineering and technical sciences, Humanities | 312 | 312 |
与原始的plT5实现一样,训练数据集使用了包含50,000个标记的sentencepiece unigram模型进行了标记化处理。
from transformers import T5Tokenizer, T5ForConditionalGeneration model = T5ForConditionalGeneration.from_pretrained("Voicelab/vlt5-base-keywords") tokenizer = T5Tokenizer.from_pretrained("Voicelab/vlt5-base-keywords") task_prefix = "Keywords: " inputs = [ "Christina Katrakis, who spoke to the BBC from Vorokhta in western Ukraine, relays the account of one family, who say Russian soldiers shot at their vehicles while they were leaving their village near Chernobyl in northern Ukraine. She says the cars had white flags and signs saying they were carrying children.", "Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.", "Hello, I'd like to order a pizza with salami topping.", ] for sample in inputs: input_sequences = [task_prefix + sample] input_ids = tokenizer( input_sequences, return_tensors="pt", truncation=True ).input_ids output = model.generate(input_ids, no_repeat_ngram_size=3, num_beams=4) predicted = tokenizer.decode(output[0], skip_special_tokens=True) print(sample, "\n --->", predicted)
我们的结果表明,设置no_repeat_ngram_size=3、num_beams=4可以获得最佳的生成结果。
Method | Rank | Micro | Macro | ||||
---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | ||
extremeText | 1 | 0.175 | 0.038 | 0.063 | 0.007 | 0.004 | 0.005 |
3 | 0.117 | 0.077 | 0.093 | 0.011 | 0.011 | 0.011 | |
5 | 0.090 | 0.099 | 0.094 | 0.013 | 0.016 | 0.015 | |
10 | 0.060 | 0.131 | 0.082 | 0.015 | 0.025 | 0.019 | |
vlT5kw | 1 | 0.345 | 0.076 | 0.124 | 0.054 | 0.047 | 0.050 |
3 | 0.328 | 0.212 | 0.257 | 0.133 | 0.127 | 0.129 | |
5 | 0.318 | 0.237 | 0.271 | 0.143 | 0.140 | 0.141 | |
KeyBERT | 1 | 0.030 | 0.007 | 0.011 | 0.004 | 0.003 | 0.003 |
3 | 0.015 | 0.010 | 0.012 | 0.006 | 0.004 | 0.005 | |
5 | 0.011 | 0.012 | 0.011 | 0.006 | 0.005 | 0.005 | |
TermoPL | 1 | 0.118 | 0.026 | 0.043 | 0.004 | 0.003 | 0.003 |
3 | 0.070 | 0.046 | 0.056 | 0.006 | 0.005 | 0.006 | |
5 | 0.051 | 0.056 | 0.053 | 0.007 | 0.007 | 0.007 | |
all | 0.025 | 0.339 | 0.047 | 0.017 | 0.030 | 0.022 | |
extremeText | 1 | 0.210 | 0.077 | 0.112 | 0.037 | 0.017 | 0.023 |
3 | 0.139 | 0.152 | 0.145 | 0.045 | 0.042 | 0.043 | |
5 | 0.107 | 0.196 | 0.139 | 0.049 | 0.063 | 0.055 | |
10 | 0.072 | 0.262 | 0.112 | 0.041 | 0.098 | 0.058 | |
vlT5kw | 1 | 0.377 | 0.138 | 0.202 | 0.119 | 0.071 | 0.089 |
3 | 0.361 | 0.301 | 0.328 | 0.185 | 0.147 | 0.164 | |
5 | 0.357 | 0.316 | 0.335 | 0.188 | 0.153 | 0.169 | |
KeyBERT | 1 | 0.018 | 0.007 | 0.010 | 0.003 | 0.001 | 0.001 |
3 | 0.009 | 0.010 | 0.009 | 0.004 | 0.001 | 0.002 | |
5 | 0.007 | 0.012 | 0.009 | 0.004 | 0.001 | 0.002 | |
TermoPL | 1 | 0.076 | 0.028 | 0.041 | 0.002 | 0.001 | 0.001 |
3 | 0.046 | 0.051 | 0.048 | 0.003 | 0.001 | 0.002 | |
5 | 0.033 | 0.061 | 0.043 | 0.003 | 0.001 | 0.002 | |
all | 0.021 | 0.457 | 0.040 | 0.004 | 0.008 | 0.005 |
CC BY 4.0
如果您使用了这个模型,请引用以下论文: Pęzik, P., Mikołajczyk, A., Wawrzyński, A., Żarnecki, F., Nitoń, B., Ogrodniczuk, M. (2023). Transferable Keyword Extraction and Generation with Text-to-Text Language Models. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978-3-031-36021-3_42 或 Piotr Pęzik, Agnieszka Mikołajczyk-Bareła, Adam Wawrzyński, Bartłomiej Nitoń, Maciej Ogrodniczuk, Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer, ACIIDS 2022 。
该模型是由Voicelab.ai的自然语言处理研究团队进行训练的。
您可以通过 here 联系我们。