模型:
mrm8488/longformer-base-4096-finetuned-squadv2
在 SQuAD v2 上进行了Q&A下游任务的微调。
Longformer 是一个适用于长文档的Transformer模型。
longformer-base-4096是从RoBERTa检查点开始的类似于BERT的模型,并在长文档上进行了MLM的预训练。它支持长度最多达到4,096的序列。
Longformer使用滑动窗口(局部)注意力和全局注意力的组合。全局注意力根据任务进行用户配置,使模型能够学习特定于任务的表示。
数据集ID:squad_v2,来源: HuggingFace/Datasets
Dataset | Split | # samples |
---|---|---|
squad_v2 | train | 130319 |
squad_v2 | valid | 11873 |
如何从 datasets 加载它
!pip install datasets from datasets import load_dataset dataset = load_dataset('squad_v2')
在 Datasets Viewer 中了解更多关于这个数据集和其他数据集的信息
训练脚本是 this one 的稍作修改的版本
import torch from transformers import AutoTokenizer, AutoModelForQuestionAnswering ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2" tokenizer = AutoTokenizer.from_pretrained(ckpt) model = AutoModelForQuestionAnswering.from_pretrained(ckpt) text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." question = "What has Huggingface done ?" encoding = tokenizer(question, text, return_tensors="pt") input_ids = encoding["input_ids"] # default is local attention everywhere # the forward method will automatically set global attention on question tokens attention_mask = encoding["attention_mask"] start_scores, end_scores = model(input_ids, attention_mask=attention_mask) all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist()) answer_tokens = all_tokens[torch.argmax(start_scores) :torch.argmax(end_scores)+1] answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens)) # output => democratized NLP
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2" tokenizer = AutoTokenizer.from_pretrained(ckpt) model = AutoModelForQuestionAnswering.from_pretrained(ckpt) qa = pipeline("question-answering", model=model, tokenizer=tokenizer) text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." question = "What has Huggingface done?" qa({"question": question, "context": text})
如果给定相同的上下文,我们询问的内容不在其中,则没有答案的输出将为<s>
Created by Manuel Romero/@mrm8488 | LinkedIn
Made with ♥ in Spain