模型:

Voicelab/vlt5-base-keywords

任务:

文生文

类库:

PyTorch Safetensors Transformers

数据集:

posmac 3Aposmac

语言:

其他:

t5 keywords-generation text-classifiation other AutoTrain Compatible text-generation-inference

预印本库:

arxiv:2209.14008

许可:

cc-by-4.0

模型介绍文件清单

英文

使用T5从短文本中提取关键词

我们的vlT5模型是基于编码器-解码器架构的关键词生成模型，使用了Google提出的Transformer块（ https://huggingface.co/t5-base ）。vlT5在科学文章语料库上进行了训练，根据文章摘要和标题的拼接来预测给定集合的关键词。它仅基于摘要生成了描述文章内容的精确但不总是完整的关键词。

使用vlT5-base-keywords生成的关键词：编码器-解码器架构，关键词生成

在演示模型上的结果（不同的生成方法，每种语言一个模型）：

使用vlT5-base-keywords生成的关键词：编码器-解码器架构，vlT5，关键词生成，科学文章语料库

vlT5

vlT5模型的最大优势在于其可迁移性，适用于所有领域和类型的文本。缺点是文本长度和关键词数量与训练数据相似：摘要长度的文本片段生成大约3到5个关键词。它可以进行萃取和摘录处理。较长的文本必须分割成较小的片段，然后传递给模型。

概述

语言模型： t5-base
语言：pl、en（但在其他语言上也相对有效）
训练数据：POSMAC
在线演示：访问我们的在线演示以获得更好的结果 https://nlp-demo-1.voicelab.ai/
论文： Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer, ACIIDS 2022

语料库

该模型是在POSMAC语料库上进行训练的。波兰开放科学元数据语料库（POSMAC）是CURLICAT项目中收集的216,214篇科学出版物摘要的集合。

Domains	Documents	With keywords
Engineering and technical sciences	58 974	57 165
Social sciences	58 166	41 799
Agricultural sciences	29 811	15 492
Humanities	22 755	11 497
Exact and natural sciences	13 579	9 185
Humanities, Social sciences	12 809	7 063
Medical and health sciences	6 030	3 913
Medical and health sciences, Social sciences	828	571
Humanities, Medical and health sciences, Social sciences	601	455
Engineering and technical sciences, Humanities	312	312

分词器

与原始的plT5实现一样，训练数据集使用了包含50,000个标记的sentencepiece unigram模型进行了标记化处理。

使用方法

from transformers import T5Tokenizer, T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained("Voicelab/vlt5-base-keywords")
tokenizer = T5Tokenizer.from_pretrained("Voicelab/vlt5-base-keywords")

task_prefix = "Keywords: "
inputs = [
    "Christina Katrakis, who spoke to the BBC from Vorokhta in western Ukraine, relays the account of one family, who say Russian soldiers shot at their vehicles while they were leaving their village near Chernobyl in northern Ukraine. She says the cars had white flags and signs saying they were carrying children.",
    "Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.",
    "Hello, I'd like to order a pizza with salami topping.",
]

for sample in inputs:
    input_sequences = [task_prefix + sample]
    input_ids = tokenizer(
        input_sequences, return_tensors="pt", truncation=True
    ).input_ids
    output = model.generate(input_ids, no_repeat_ngram_size=3, num_beams=4)
    predicted = tokenizer.decode(output[0], skip_special_tokens=True)
    print(sample, "\n --->", predicted)

推断

我们的结果表明，设置no_repeat_ngram_size=3、num_beams=4可以获得最佳的生成结果。

结果

Method	Rank	Micro	Macro
P	R	F1	P	R	F1
extremeText	1	0.175	0.038	0.063	0.007	0.004	0.005
3	0.117	0.077	0.093	0.011	0.011	0.011
5	0.090	0.099	0.094	0.013	0.016	0.015
10	0.060	0.131	0.082	0.015	0.025	0.019
vlT5kw	1	0.345	0.076	0.124	0.054	0.047	0.050
3	0.328	0.212	0.257	0.133	0.127	0.129
5	0.318	0.237	0.271	0.143	0.140	0.141
KeyBERT	1	0.030	0.007	0.011	0.004	0.003	0.003
3	0.015	0.010	0.012	0.006	0.004	0.005
5	0.011	0.012	0.011	0.006	0.005	0.005
TermoPL	1	0.118	0.026	0.043	0.004	0.003	0.003
3	0.070	0.046	0.056	0.006	0.005	0.006
5	0.051	0.056	0.053	0.007	0.007	0.007
all	0.025	0.339	0.047	0.017	0.030	0.022
extremeText	1	0.210	0.077	0.112	0.037	0.017	0.023
3	0.139	0.152	0.145	0.045	0.042	0.043
5	0.107	0.196	0.139	0.049	0.063	0.055
10	0.072	0.262	0.112	0.041	0.098	0.058
vlT5kw	1	0.377	0.138	0.202	0.119	0.071	0.089
3	0.361	0.301	0.328	0.185	0.147	0.164
5	0.357	0.316	0.335	0.188	0.153	0.169
KeyBERT	1	0.018	0.007	0.010	0.003	0.001	0.001
3	0.009	0.010	0.009	0.004	0.001	0.002
5	0.007	0.012	0.009	0.004	0.001	0.002
TermoPL	1	0.076	0.028	0.041	0.002	0.001	0.001
3	0.046	0.051	0.048	0.003	0.001	0.002
5	0.033	0.061	0.043	0.003	0.001	0.002
all	0.021	0.457	0.040	0.004	0.008	0.005

许可证

CC BY 4.0

引用

如果您使用了这个模型，请引用以下论文： Pęzik, P., Mikołajczyk, A., Wawrzyński, A., Żarnecki, F., Nitoń, B., Ogrodniczuk, M. (2023). Transferable Keyword Extraction and Generation with Text-to-Text Language Models. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978-3-031-36021-3_42 或 Piotr Pęzik, Agnieszka Mikołajczyk-Bareła, Adam Wawrzyński, Bartłomiej Nitoń, Maciej Ogrodniczuk, Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer, ACIIDS 2022 。

作者

该模型是由Voicelab.ai的自然语言处理研究团队进行训练的。

您可以通过 here 联系我们。

作者:

VoiceLab.ai

数据集大小:

2.05 GB