问答XLM-RoBERTA是在2.5TB的经过过滤的CommonCrawl数据上进行预训练的多语言语言模型,包含100种语言。它在Conneau等人的论文 Unsupervised Cross-lingual Representation Learning at Scale 中被介绍。
多语言模型 XLM-RoBERTa large for QA on various languages 在各种问答数据集上进行了微调,但尚未在迄今为止最大的波斯问答数据集PQuAD上进行微调。这个第二个模型是我们的基础模型,用于进行微调。
PQuAD数据集的论文介绍: arXiv:2202.06219
由于Google Colab的GPU内存限制,我将批大小设置为4。
batch_size = 4 n_epochs = 1 base_LM_model = "deepset/xlm-roberta-large-squad2" max_seq_len = 256 learning_rate = 3e-5 evaluation_strategy = "epoch", save_strategy = "epoch", learning_rate = 3e-5, warmup_ratio = 0.1, gradient_accumulation_steps = 8, weight_decay = 0.01,
在PQuAD波斯测试集上进行评估,得到了 official PQuAD link 的结果。我也进行了超过1个epochs的训练,但结果变得更差了。我们的XLM-Roberta表现优于 our ParsBert on PQuAD ,但前者的大小是后者的3倍多,所以比较这两者是不公平的。
Metric | Our XLM-Roberta Large | Our ParsBert |
Exact Match | 66.56* | 47.44 |
F1 | 87.31* | 81.96 |
from transformers import AutoTokenizer, AutoModelForQuestionAnswering path = 'pedramyazdipoor/persian_xlm_roberta_large' tokenizer = AutoTokenizer.from_pretrained(path) model = AutoModelForQuestionAnswering.from_pretrained(path)
def generate_indexes(start_logits, end_logits, N, min_index): output_start = start_logits output_end = end_logits start_indexes = np.arange(len(start_logits)) start_probs = output_start list_start = dict(zip(start_indexes, start_probs.tolist())) end_indexes = np.arange(len(end_logits)) end_probs = output_end list_end = dict(zip(end_indexes, end_probs.tolist())) sorted_start_list = sorted(list_start.items(), key=lambda x: x[1], reverse=True) #Descending sort by probability sorted_end_list = sorted(list_end.items(), key=lambda x: x[1], reverse=True) final_start_idx, final_end_idx = [[] for l in range(2)] start_idx, end_idx, prob = 0, 0, (start_probs.tolist()[0] + end_probs.tolist()[0]) for a in range(0,N): for b in range(0,N): if (sorted_start_list[a][1] + sorted_end_list[b][1]) > prob : if (sorted_start_list[a][0] <= sorted_end_list[b][0]) and (sorted_start_list[a][0] > min_index) : prob = sorted_start_list[a][1] + sorted_end_list[b][1] start_idx = sorted_start_list[a][0] end_idx = sorted_end_list[b][0] final_start_idx.append(start_idx) final_end_idx.append(end_idx) return final_start_idx[0], final_end_idx[0]
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") model.eval().to(device) text = 'سلام من پدرامم 26 سالمه' question = 'چند سالمه؟' encoding = tokenizer(question,text,add_special_tokens = True, return_token_type_ids = True, return_tensors = 'pt', padding = True, return_offsets_mapping = True, truncation = 'only_first', max_length = 32) out = model(encoding['input_ids'].to(device),encoding['attention_mask'].to(device), encoding['token_type_ids'].to(device)) #we had to change some pieces of code to make it compatible with one answer generation at a time #If you have unanswerable questions, use out['start_logits'][0][0:] and out['end_logits'][0][0:] because <s> (the 1st token) is for this situation and must be compared with other tokens. #you can initialize min_index in generate_indexes() to put force on tokens being chosen to be within the context(startindex must be greater than seperator token). answer_start_index, answer_end_index = generate_indexes(out['start_logits'][0][1:], out['end_logits'][0][1:], 5, 0) print(tokenizer.tokenize(text + question)) print(tokenizer.tokenize(text + question)[answer_start_index : (answer_end_index + 1)]) >>> ['▁سلام', '▁من', '▁پدر', 'ام', 'م', '▁26', '▁سالم', 'ه', 'نام', 'م', '▁چیست', '؟'] >>> ['▁26']
我们在此感谢 Newsha Shahbodaghkhan 为收集数据集提供便利。