模型:

etalab-ia/dpr-ctx_encoder-fr_qa-camembert

类库:

PyTorch Transformers

数据集:

piaf FQuAD SQuAD-FR 3ASQuAD-FR 3AFQuAD 3Apiaf

语言:

其他:

camembert

预印本库:

arxiv:2004.04906 arxiv:1911.03894

模型介绍文件清单

英文

dpr-ctx_encoder-fr_qa-camembert

描述

French DPR model 使用 CamemBERT 作为基础，然后在三个法语问答组合上进行微调

数据

法语问答

我们使用了三个法语问答数据集的组合：

PIAFv1.1

FQuADv1.0

SQuAD-FR (SQuAD automatically translated to French)

训练

我们使用了90,562个随机问题用于训练，22,391个用于开发。训练集中的问题不会出现在开发集中。对于每个问题，我们有一个单独的positive_context（包含问题答案的段落）和大约30个hard_negtive_contexts（不包含答案的候选段落）。硬负面上下文是通过查询ES实例（通过bm25检索）并获取不包含答案的前k个候选项来获取的。

文件超过 here 。

评估

我们使用FQuADv1.0和French-SQuAD评估集。

训练脚本

我们使用官方的 Facebook DPR implentation 进行训练，略有修改：默认情况下，该代码适用于Roberta模型，但我们更改了一行代码以便更容易与Camembert一起使用。可以在 over here 找到这个修改。

超参数

python -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py \
 --max_grad_norm 2.0 \
 --encoder_model_type fairseq_roberta \
 --pretrained_file data/camembert-base \
 --seed 12345 \
 --sequence_length 256 \
 --warmup_steps 1237 \
 --batch_size 16 \
 --do_lower_case \
 --train_file ./data/DPR_FR_train.json \
 --dev_file  ./data/DPR_FR_dev.json \
 --output_dir ./output/ \
 --learning_rate 2e-05 \
 --num_train_epochs 35 \
 --dev_batch_size 16 \
 --val_av_rank_start_epoch 30 \
 --pretrained_model_cfg ./data/camembert-base/

评估结果

我们使用FQuAD和SQuAD-FR评估（或验证）集获得以下评估结果。为了获得这些结果，我们使用 haystack's evaluation script （仅报告检索结果）。

DPR

FQuAD v1.0 评估

For 2764 out of 3184 questions (86.81%), the answer was in the top-20 candidate passages selected by the retriever.
Retriever Recall: 0.87
Retriever Mean Avg Precision: 0.57

SQuAD-FR 评估

For 8945 out of 10018 questions (89.29%), the answer was in the top-20 candidate passages selected by the retriever.
Retriever Recall: 0.89
Retriever Mean Avg Precision: 0.63

BM25

作为参考，BM25的结果如下所示。与原始论文一样，对于类似SQuAD的数据集，DPR的结果始终被BM25超过。

FQuAD v1.0 评估

For 2966 out of 3184 questions (93.15%), the answer was in the top-20 candidate passages selected by the retriever.
Retriever Recall: 0.93
Retriever Mean Avg Precision: 0.74

SQuAD-FR 评估

For 9353 out of 10018 questions (93.36%), the answer was in the top-20 candidate passages selected by the retriever.
Retriever Recall: 0.93
Retriever Mean Avg Precision: 0.77

使用方法

这里报告的结果是使用haystack库获得的。要使用仅HF transformers库获得类似的嵌入，可以执行以下操作：

from transformers import AutoTokenizer, AutoModel
query = "Salut, mon chien est-il mignon ?"
tokenizer = AutoTokenizer.from_pretrained("etalab-ia/dpr-ctx_encoder-fr_qa-camembert",  do_lower_case=True)
input_ids = tokenizer(query, return_tensors='pt')["input_ids"]
model = AutoModel.from_pretrained("etalab-ia/dpr-ctx_encoder-fr_qa-camembert", return_dict=True)
embeddings = model.forward(input_ids).pooler_output
print(embeddings)

而使用haystack时，我们将其用作检索器：

retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="etalab-ia/dpr-question_encoder-fr_qa-camembert",
    passage_embedding_model="etalab-ia/dpr-ctx_encoder-fr_qa-camembert",
    model_version=dpr_model_tag,
    infer_tokenizer_classes=True,
)

致谢

这项工作是使用GENCI–IDRIS（授予2020-AD011011224）的HPC资源完成的。

引用

数据集

PIAF

@inproceedings{KeraronLBAMSSS20,
  author    = {Rachel Keraron and
               Guillaume Lancrenon and
               Mathilde Bras and
               Fr{\'{e}}d{\'{e}}ric Allary and
               Gilles Moyse and
               Thomas Scialom and
               Edmundo{-}Pavel Soriano{-}Morales and
               Jacopo Staiano},
  title     = {Project {PIAF:} Building a Native French Question-Answering Dataset},
  booktitle = {{LREC}},
  pages     = {5481--5490},
  publisher = {European Language Resources Association},
  year      = {2020}
}

FQuAD

@article{dHoffschmidt2020FQuADFQ,
  title={FQuAD: French Question Answering Dataset},
  author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.06071}
}

SQuAD-FR

 @MISC{kabbadj2018,
   author =       "Kabbadj, Ali",
   title =        "Something new in French Text Mining and Information Extraction (Universal Chatbot): Largest Q&A French training dataset (110 000+) ",
   editor =       "linkedin.com",
   month =        "November",
   year =         "2018",
   url =          "\url{https://www.linkedin.com/pulse/something-new-french-text-mining-information-chatbot-largest-kabbadj/}",
   note =         "[Online; posted 11-November-2018]",
 }

模型

CamemBERT

HF模型卡片： https://huggingface.co/camembert-base

@inproceedings{martin2020camembert,
  title={CamemBERT: a Tasty French Language Model},
  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  year={2020}
}

DPR

@misc{karpukhin2020dense,
    title={Dense Passage Retrieval for Open-Domain Question Answering},
    author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih},
    year={2020},
    eprint={2004.04906},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

作者:

Etalab Lab-IA

数据集大小:

424.86 MB