

我们的vlT5模型是基于编码器-解码器架构的关键词生成模型,使用了Google提出的Transformer块( https://huggingface.co/t5-base )。vlT5在科学文章语料库上进行了训练,根据文章摘要和标题的拼接来预测给定集合的关键词。它仅基于摘要生成了描述文章内容的精确但不总是完整的关键词。



我们的vlT5模型是基于编码器-解码器架构的关键词生成模型,使用了Google提出的Transformer块( https://huggingface.co/t5-base )。vlT5在科学文章语料库上进行了训练,根据文章摘要和标题的拼接来预测给定集合的关键词。它仅基于摘要生成了描述文章内容的精确但不总是完整的关键词。







Domains Documents With keywords
Engineering and technical sciences 58 974 57 165
Social sciences 58 166 41 799
Agricultural sciences 29 811 15 492
Humanities 22 755 11 497
Exact and natural sciences 13 579 9 185
Humanities, Social sciences 12 809 7 063
Medical and health sciences 6 030 3 913
Medical and health sciences, Social sciences 828 571
Humanities, Medical and health sciences, Social sciences 601 455
Engineering and technical sciences, Humanities 312 312


与原始的plT5实现一样,训练数据集使用了包含50,000个标记的sentencepiece unigram模型进行了标记化处理。


from transformers import T5Tokenizer, T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained("Voicelab/vlt5-base-keywords")
tokenizer = T5Tokenizer.from_pretrained("Voicelab/vlt5-base-keywords")

task_prefix = "Keywords: "
inputs = [
    "Christina Katrakis, who spoke to the BBC from Vorokhta in western Ukraine, relays the account of one family, who say Russian soldiers shot at their vehicles while they were leaving their village near Chernobyl in northern Ukraine. She says the cars had white flags and signs saying they were carrying children.",
    "Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.",
    "Hello, I'd like to order a pizza with salami topping.",

for sample in inputs:
    input_sequences = [task_prefix + sample]
    input_ids = tokenizer(
        input_sequences, return_tensors="pt", truncation=True
    output = model.generate(input_ids, no_repeat_ngram_size=3, num_beams=4)
    predicted = tokenizer.decode(output[0], skip_special_tokens=True)
    print(sample, "\n --->", predicted)




Method Rank Micro Macro
P R F1 P R F1
extremeText 1 0.175 0.038 0.063 0.007 0.004 0.005
3 0.117 0.077 0.093 0.011 0.011 0.011
5 0.090 0.099 0.094 0.013 0.016 0.015
10 0.060 0.131 0.082 0.015 0.025 0.019
vlT5kw 1 0.345 0.076 0.124 0.054 0.047 0.050
3 0.328 0.212 0.257 0.133 0.127 0.129
5 0.318 0.237 0.271 0.143 0.140 0.141
KeyBERT 1 0.030 0.007 0.011 0.004 0.003 0.003
3 0.015 0.010 0.012 0.006 0.004 0.005
5 0.011 0.012 0.011 0.006 0.005 0.005
TermoPL 1 0.118 0.026 0.043 0.004 0.003 0.003
3 0.070 0.046 0.056 0.006 0.005 0.006
5 0.051 0.056 0.053 0.007 0.007 0.007
all 0.025 0.339 0.047 0.017 0.030 0.022
extremeText 1 0.210 0.077 0.112 0.037 0.017 0.023
3 0.139 0.152 0.145 0.045 0.042 0.043
5 0.107 0.196 0.139 0.049 0.063 0.055
10 0.072 0.262 0.112 0.041 0.098 0.058
vlT5kw 1 0.377 0.138 0.202 0.119 0.071 0.089
3 0.361 0.301 0.328 0.185 0.147 0.164
5 0.357 0.316 0.335 0.188 0.153 0.169
KeyBERT 1 0.018 0.007 0.010 0.003 0.001 0.001
3 0.009 0.010 0.009 0.004 0.001 0.002
5 0.007 0.012 0.009 0.004 0.001 0.002
TermoPL 1 0.076 0.028 0.041 0.002 0.001 0.001
3 0.046 0.051 0.048 0.003 0.001 0.002
5 0.033 0.061 0.043 0.003 0.001 0.002
all 0.021 0.457 0.040 0.004 0.008 0.005


CC BY 4.0


如果您使用了这个模型,请引用以下论文: Pęzik, P., Mikołajczyk, A., Wawrzyński, A., Żarnecki, F., Nitoń, B., Ogrodniczuk, M. (2023). Transferable Keyword Extraction and Generation with Text-to-Text Language Models. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978-3-031-36021-3_42 Piotr Pęzik, Agnieszka Mikołajczyk-Bareła, Adam Wawrzyński, Bartłomiej Nitoń, Maciej Ogrodniczuk, Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer, ACIIDS 2022



您可以通过 here 联系我们。