英文

使用T5从短文本中提取关键词

我们的vlT5模型是基于编码器-解码器架构的关键词生成模型,使用了Google提出的Transformer块( https://huggingface.co/t5-base )。vlT5在科学文章语料库上进行了训练,根据文章摘要和标题的拼接来预测给定集合的关键词。它仅基于摘要生成了描述文章内容的精确但不总是完整的关键词。

使用vlT5-base-keywords生成的关键词:编码器-解码器架构,关键词生成

在演示模型上的结果(不同的生成方法,每种语言一个模型):

我们的vlT5模型是基于编码器-解码器架构的关键词生成模型,使用了Google提出的Transformer块( https://huggingface.co/t5-base )。vlT5在科学文章语料库上进行了训练,根据文章摘要和标题的拼接来预测给定集合的关键词。它仅基于摘要生成了描述文章内容的精确但不总是完整的关键词。

使用vlT5-base-keywords生成的关键词:编码器-解码器架构,vlT5,关键词生成,科学文章语料库

vlT5

vlT5模型的最大优势在于其可迁移性,适用于所有领域和类型的文本。缺点是文本长度和关键词数量与训练数据相似:摘要长度的文本片段生成大约3到5个关键词。它可以进行萃取和摘录处理。较长的文本必须分割成较小的片段,然后传递给模型。

概述

语料库

该模型是在POSMAC语料库上进行训练的。波兰开放科学元数据语料库(POSMAC)是CURLICAT项目中收集的216,214篇科学出版物摘要的集合。

Domains Documents With keywords
Engineering and technical sciences 58 974 57 165
Social sciences 58 166 41 799
Agricultural sciences 29 811 15 492
Humanities 22 755 11 497
Exact and natural sciences 13 579 9 185
Humanities, Social sciences 12 809 7 063
Medical and health sciences 6 030 3 913
Medical and health sciences, Social sciences 828 571
Humanities, Medical and health sciences, Social sciences 601 455
Engineering and technical sciences, Humanities 312 312

分词器

与原始的plT5实现一样,训练数据集使用了包含50,000个标记的sentencepiece unigram模型进行了标记化处理。

使用方法

from transformers import T5Tokenizer, T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained("Voicelab/vlt5-base-keywords")
tokenizer = T5Tokenizer.from_pretrained("Voicelab/vlt5-base-keywords")

task_prefix = "Keywords: "
inputs = [
    "Christina Katrakis, who spoke to the BBC from Vorokhta in western Ukraine, relays the account of one family, who say Russian soldiers shot at their vehicles while they were leaving their village near Chernobyl in northern Ukraine. She says the cars had white flags and signs saying they were carrying children.",
    "Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.",
    "Hello, I'd like to order a pizza with salami topping.",
]

for sample in inputs:
    input_sequences = [task_prefix + sample]
    input_ids = tokenizer(
        input_sequences, return_tensors="pt", truncation=True
    ).input_ids
    output = model.generate(input_ids, no_repeat_ngram_size=3, num_beams=4)
    predicted = tokenizer.decode(output[0], skip_special_tokens=True)
    print(sample, "\n --->", predicted)

推断

我们的结果表明,设置no_repeat_ngram_size=3、num_beams=4可以获得最佳的生成结果。

结果

Method Rank Micro Macro
P R F1 P R F1
extremeText 1 0.175 0.038 0.063 0.007 0.004 0.005
3 0.117 0.077 0.093 0.011 0.011 0.011
5 0.090 0.099 0.094 0.013 0.016 0.015
10 0.060 0.131 0.082 0.015 0.025 0.019
vlT5kw 1 0.345 0.076 0.124 0.054 0.047 0.050
3 0.328 0.212 0.257 0.133 0.127 0.129
5 0.318 0.237 0.271 0.143 0.140 0.141
KeyBERT 1 0.030 0.007 0.011 0.004 0.003 0.003
3 0.015 0.010 0.012 0.006 0.004 0.005
5 0.011 0.012 0.011 0.006 0.005 0.005
TermoPL 1 0.118 0.026 0.043 0.004 0.003 0.003
3 0.070 0.046 0.056 0.006 0.005 0.006
5 0.051 0.056 0.053 0.007 0.007 0.007
all 0.025 0.339 0.047 0.017 0.030 0.022
extremeText 1 0.210 0.077 0.112 0.037 0.017 0.023
3 0.139 0.152 0.145 0.045 0.042 0.043
5 0.107 0.196 0.139 0.049 0.063 0.055
10 0.072 0.262 0.112 0.041 0.098 0.058
vlT5kw 1 0.377 0.138 0.202 0.119 0.071 0.089
3 0.361 0.301 0.328 0.185 0.147 0.164
5 0.357 0.316 0.335 0.188 0.153 0.169
KeyBERT 1 0.018 0.007 0.010 0.003 0.001 0.001
3 0.009 0.010 0.009 0.004 0.001 0.002
5 0.007 0.012 0.009 0.004 0.001 0.002
TermoPL 1 0.076 0.028 0.041 0.002 0.001 0.001
3 0.046 0.051 0.048 0.003 0.001 0.002
5 0.033 0.061 0.043 0.003 0.001 0.002
all 0.021 0.457 0.040 0.004 0.008 0.005

许可证

CC BY 4.0

引用

如果您使用了这个模型,请引用以下论文: Pęzik, P., Mikołajczyk, A., Wawrzyński, A., Żarnecki, F., Nitoń, B., Ogrodniczuk, M. (2023). Transferable Keyword Extraction and Generation with Text-to-Text Language Models. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978-3-031-36021-3_42 Piotr Pęzik, Agnieszka Mikołajczyk-Bareła, Adam Wawrzyński, Bartłomiej Nitoń, Maciej Ogrodniczuk, Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer, ACIIDS 2022

作者

该模型是由Voicelab.ai的自然语言处理研究团队进行训练的。

您可以通过 here 联系我们。