模型:

etalab-ia/dpr-question_encoder-fr_qa-camembert

英文

dpr-question_encoder-fr_qa-camembert

描述

法语 DPR model CamemBERT 为基础进行细调,然后在三个法语问答数据集的组合上进行细调

数据

法语问答

我们使用了三个法语问答数据集的组合:

  • PIAFv1.1
  • FQuADv1.0
  • SQuAD-FR (SQuAD automatically translated to French)
  • 训练

    我们使用了90,562个随机问题进行训练,并使用22,391个问题进行开发。训练集中没有与开发集中存在的问题相同的。对于每个问题,我们有一个正面上下文(包含该问题答案的段落)和大约30个难负上下文(不包含答案的上下文)。通过查询ES实例(通过bm25检索)并获取不包含答案的前k个候选项来找到难负上下文。

    文件超过了 here

    评估

    我们使用FQuADv1.0和French-SQuAD评估集。

    训练脚本

    我们使用了官方的 Facebook DPR implentation 训练脚本,稍作修改:默认情况下,代码可与Roberta模型配合使用,但我们改了一行代码以更容易地与Camembert一起使用。修改可以在 over here 中找到。

    超参数

    python -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py \
    --max_grad_norm 2.0 --encoder_model_type hf_bert --pretrained_file data/bert-base-multilingual-uncased \
    --seed 12345 --sequence_length 256 --warmup_steps 1237 --batch_size 16 --do_lower_case \
    --train_file DPR_FR_train.json \
    --dev_file  ./data/100_hard_neg_ctxs/DPR_FR_dev.json \
    --output_dir ./output/bert --learning_rate 2e-05 --num_train_epochs 35 \
    --dev_batch_size 16 --val_av_rank_start_epoch 25 \
    --pretrained_model_cfg ./data/bert-base-multilingual-uncased
    

    评估结果

    我们使用FQuAD和SQuAD-FR评估(或验证)集获得以下评估结果。为了得到这些结果,我们使用 haystack's evaluation script (我们仅报告检索结果)。

    DPR

    FQuAD v1.0 评估
    For 2764 out of 3184 questions (86.81%), the answer was in the top-20 candidate passages selected by the retriever.
    Retriever Recall: 0.87
    Retriever Mean Avg Precision: 0.57
    
    SQuAD-FR 评估
    For 8945 out of 10018 questions (89.29%), the answer was in the top-20 candidate passages selected by the retriever.
    Retriever Recall: 0.89
    Retriever Mean Avg Precision: 0.63
    

    BM25

    供参考,BM25的结果如下。与原始论文一样,对于类似SQuAD的数据集,DPR的结果始终被BM25超越。

    FQuAD v1.0 评估
    For 2966 out of 3184 questions (93.15%), the answer was in the top-20 candidate passages selected by the retriever.
    Retriever Recall: 0.93
    Retriever Mean Avg Precision: 0.74
    
    SQuAD-FR 评估
    For 9353 out of 10018 questions (93.36%), the answer was in the top-20 candidate passages selected by the retriever.
    Retriever Recall: 0.93
    Retriever Mean Avg Precision: 0.77
    

    用法

    这里报告的结果是使用haystack库获得的。要使用仅HF transformers库获得类似的嵌入,您可以执行以下操作:

    from transformers import AutoTokenizer, AutoModel
    query = "Salut, mon chien est-il mignon ?"
    tokenizer = AutoTokenizer.from_pretrained("etalab-ia/dpr-question_encoder-fr_qa-camembert",  do_lower_case=True)
    input_ids = tokenizer(query, return_tensors='pt')["input_ids"]
    model = AutoModel.from_pretrained("etalab-ia/dpr-question_encoder-fr_qa-camembert", return_dict=True)
    embeddings = model.forward(input_ids).pooler_output
    print(embeddings)
    

    使用haystack作为检索器:

    retriever = DensePassageRetriever(
        document_store=document_store,
        query_embedding_model="etalab-ia/dpr-question_encoder-fr_qa-camembert",
        passage_embedding_model="etalab-ia/dpr-ctx_encoder-fr_qa-camembert",
        model_version=dpr_model_tag,
        infer_tokenizer_classes=True,
    )
    

    致谢

    本工作使用了GENCI–IDRIS的HPC资源(授权编号为2020-AD011011224)。

    引用

    数据集

    PIAF
    @inproceedings{KeraronLBAMSSS20,
      author    = {Rachel Keraron and
                   Guillaume Lancrenon and
                   Mathilde Bras and
                   Fr{\'{e}}d{\'{e}}ric Allary and
                   Gilles Moyse and
                   Thomas Scialom and
                   Edmundo{-}Pavel Soriano{-}Morales and
                   Jacopo Staiano},
      title     = {Project {PIAF:} Building a Native French Question-Answering Dataset},
      booktitle = {{LREC}},
      pages     = {5481--5490},
      publisher = {European Language Resources Association},
      year      = {2020}
    }
    
    FQuAD
    @article{dHoffschmidt2020FQuADFQ,
      title={FQuAD: French Question Answering Dataset},
      author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
      journal={ArXiv},
      year={2020},
      volume={abs/2002.06071}
    }
    
    SQuAD-FR
     @MISC{kabbadj2018,
       author =       "Kabbadj, Ali",
       title =        "Something new in French Text Mining and Information Extraction (Universal Chatbot): Largest Q&A French training dataset (110 000+) ",
       editor =       "linkedin.com",
       month =        "November",
       year =         "2018",
       url =          "\url{https://www.linkedin.com/pulse/something-new-french-text-mining-information-chatbot-largest-kabbadj/}",
       note =         "[Online; posted 11-November-2018]",
     }
    

    模型

    CamemBERT

    HF模型卡片: https://huggingface.co/camembert-base

    @inproceedings{martin2020camembert,
      title={CamemBERT: a Tasty French Language Model},
      author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
      booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
      year={2020}
    }
    
    DPR
    @misc{karpukhin2020dense,
        title={Dense Passage Retrieval for Open-Domain Question Answering},
        author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih},
        year={2020},
        eprint={2004.04906},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
    }