French DPR model 使用 CamemBERT 作为基础,然后在三个法语问答组合上进行微调
我们使用了三个法语问答数据集的组合:
我们使用了90,562个随机问题用于训练,22,391个用于开发。训练集中的问题不会出现在开发集中。对于每个问题,我们有一个单独的positive_context(包含问题答案的段落)和大约30个hard_negtive_contexts(不包含答案的候选段落)。硬负面上下文是通过查询ES实例(通过bm25检索)并获取不包含答案的前k个候选项来获取的。
文件超过 here 。
我们使用FQuADv1.0和French-SQuAD评估集。
我们使用官方的 Facebook DPR implentation 进行训练,略有修改:默认情况下,该代码适用于Roberta模型,但我们更改了一行代码以便更容易与Camembert一起使用。可以在 over here 找到这个修改。
python -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py \ --max_grad_norm 2.0 \ --encoder_model_type fairseq_roberta \ --pretrained_file data/camembert-base \ --seed 12345 \ --sequence_length 256 \ --warmup_steps 1237 \ --batch_size 16 \ --do_lower_case \ --train_file ./data/DPR_FR_train.json \ --dev_file ./data/DPR_FR_dev.json \ --output_dir ./output/ \ --learning_rate 2e-05 \ --num_train_epochs 35 \ --dev_batch_size 16 \ --val_av_rank_start_epoch 30 \ --pretrained_model_cfg ./data/camembert-base/
我们使用FQuAD和SQuAD-FR评估(或验证)集获得以下评估结果。为了获得这些结果,我们使用 haystack's evaluation script (仅报告检索结果)。
For 2764 out of 3184 questions (86.81%), the answer was in the top-20 candidate passages selected by the retriever. Retriever Recall: 0.87 Retriever Mean Avg Precision: 0.57SQuAD-FR 评估
For 8945 out of 10018 questions (89.29%), the answer was in the top-20 candidate passages selected by the retriever. Retriever Recall: 0.89 Retriever Mean Avg Precision: 0.63
作为参考,BM25的结果如下所示。与原始论文一样,对于类似SQuAD的数据集,DPR的结果始终被BM25超过。
FQuAD v1.0 评估For 2966 out of 3184 questions (93.15%), the answer was in the top-20 candidate passages selected by the retriever. Retriever Recall: 0.93 Retriever Mean Avg Precision: 0.74SQuAD-FR 评估
For 9353 out of 10018 questions (93.36%), the answer was in the top-20 candidate passages selected by the retriever. Retriever Recall: 0.93 Retriever Mean Avg Precision: 0.77
这里报告的结果是使用haystack库获得的。要使用仅HF transformers库获得类似的嵌入,可以执行以下操作:
from transformers import AutoTokenizer, AutoModel query = "Salut, mon chien est-il mignon ?" tokenizer = AutoTokenizer.from_pretrained("etalab-ia/dpr-ctx_encoder-fr_qa-camembert", do_lower_case=True) input_ids = tokenizer(query, return_tensors='pt')["input_ids"] model = AutoModel.from_pretrained("etalab-ia/dpr-ctx_encoder-fr_qa-camembert", return_dict=True) embeddings = model.forward(input_ids).pooler_output print(embeddings)
而使用haystack时,我们将其用作检索器:
retriever = DensePassageRetriever( document_store=document_store, query_embedding_model="etalab-ia/dpr-question_encoder-fr_qa-camembert", passage_embedding_model="etalab-ia/dpr-ctx_encoder-fr_qa-camembert", model_version=dpr_model_tag, infer_tokenizer_classes=True, )
这项工作是使用GENCI–IDRIS(授予2020-AD011011224)的HPC资源完成的。
@inproceedings{KeraronLBAMSSS20, author = {Rachel Keraron and Guillaume Lancrenon and Mathilde Bras and Fr{\'{e}}d{\'{e}}ric Allary and Gilles Moyse and Thomas Scialom and Edmundo{-}Pavel Soriano{-}Morales and Jacopo Staiano}, title = {Project {PIAF:} Building a Native French Question-Answering Dataset}, booktitle = {{LREC}}, pages = {5481--5490}, publisher = {European Language Resources Association}, year = {2020} }FQuAD
@article{dHoffschmidt2020FQuADFQ, title={FQuAD: French Question Answering Dataset}, author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich}, journal={ArXiv}, year={2020}, volume={abs/2002.06071} }SQuAD-FR
@MISC{kabbadj2018, author = "Kabbadj, Ali", title = "Something new in French Text Mining and Information Extraction (Universal Chatbot): Largest Q&A French training dataset (110 000+) ", editor = "linkedin.com", month = "November", year = "2018", url = "\url{https://www.linkedin.com/pulse/something-new-french-text-mining-information-chatbot-largest-kabbadj/}", note = "[Online; posted 11-November-2018]", }
HF模型卡片: https://huggingface.co/camembert-base
@inproceedings{martin2020camembert, title={CamemBERT: a Tasty French Language Model}, author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, year={2020} }DPR
@misc{karpukhin2020dense, title={Dense Passage Retrieval for Open-Domain Question Answering}, author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih}, year={2020}, eprint={2004.04906}, archivePrefix={arXiv}, primaryClass={cs.CL} }