Longformer-base-4096在SQuAD v2上进行微调

在 SQuAD v2 上进行了Q&A下游任务的微调。

Longformer-base-4096

Longformer 是一个适用于长文档的Transformer模型。

longformer-base-4096是从RoBERTa检查点开始的类似于BERT的模型，并在长文档上进行了MLM的预训练。它支持长度最多达到4,096的序列。

Longformer使用滑动窗口（局部）注意力和全局注意力的组合。全局注意力根据任务进行用户配置，使模型能够学习特定于任务的表示。

下游任务（Q&A）的详细信息 - 数据集??❓

数据集ID：squad_v2，来源： HuggingFace/Datasets

Dataset	Split	# samples
squad_v2	train	130319
squad_v2	valid	11873

如何从 datasets 加载它

!pip install datasets
from datasets import load_dataset
dataset = load_dataset('squad_v2')

在 Datasets Viewer 中了解更多关于这个数据集和其他数据集的信息

模型微调?️‍

训练脚本是 this one 的稍作修改的版本

模型展示?

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForQuestionAnswering.from_pretrained(ckpt)

text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
question = "What has Huggingface done ?"
encoding = tokenizer(question, text, return_tensors="pt")
input_ids = encoding["input_ids"]

# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
attention_mask = encoding["attention_mask"]

start_scores, end_scores = model(input_ids, attention_mask=attention_mask)
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())

answer_tokens = all_tokens[torch.argmax(start_scores) :torch.argmax(end_scores)+1]
answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens))

# output => democratized NLP

与HF pipeline一起使用

from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline

ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForQuestionAnswering.from_pretrained(ckpt)

qa = pipeline("question-answering", model=model, tokenizer=tokenizer)

text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
question = "What has Huggingface done?"

qa({"question": question, "context": text})

如果给定相同的上下文，我们询问的内容不在其中，则没有答案的输出将为<s>

Created by Manuel Romero/@mrm8488 | LinkedIn

Made with ♥ in Spain

作者:

Manuel Romero

数据集大小:

1.11 GB