英文

camembert-base-squadFR-fquad-piaf

描述

法语问答模型,基于base CamemBERT 进行微调,使用了三个法语问答数据集的组合:

  • PIAFv1.1
  • FQuADv1.0
  • SQuAD-FR (SQuAD automatically translated to French)
  • 训练超参数

    python run_squad.py \
    --model_type camembert \
    --model_name_or_path camembert-base \
    --do_train --do_eval \
    --train_file data/SQuAD+fquad+piaf.json \
    --predict_file data/fquad_valid.json \
    --per_gpu_train_batch_size 12 \ 
    --learning_rate 3e-5 \ 
    --num_train_epochs 4 \  
    --max_seq_length 384 \ 
    --doc_stride 128 \
    --save_steps 10000 
    

    评估结果

    FQuAD v1.0 评估

    {"f1": 79.81, "exact_match": 55.14}
    

    SQuAD-FR 评估

    {"f1": 80.61, "exact_match": 59.54}
    

    用法

    from transformers import pipeline
    
    nlp = pipeline('question-answering', model='etalab-ia/camembert-base-squadFR-fquad-piaf', tokenizer='etalab-ia/camembert-base-squadFR-fquad-piaf')
    
    nlp({
        'question': "Qui est Claude Monet?",
        'context': "Claude Monet, né le 14 novembre 1840 à Paris et mort le 5 décembre 1926 à Giverny, est un peintre français et l’un des fondateurs de l'impressionnisme."
    })
    

    致谢

    该工作使用GENCI–IDRIS的HPC资源进行(授予2020-AD011011224号)。

    引用

    PIAF

    @inproceedings{KeraronLBAMSSS20,
      author    = {Rachel Keraron and
                   Guillaume Lancrenon and
                   Mathilde Bras and
                   Fr{\'{e}}d{\'{e}}ric Allary and
                   Gilles Moyse and
                   Thomas Scialom and
                   Edmundo{-}Pavel Soriano{-}Morales and
                   Jacopo Staiano},
      title     = {Project {PIAF:} Building a Native French Question-Answering Dataset},
      booktitle = {{LREC}},
      pages     = {5481--5490},
      publisher = {European Language Resources Association},
      year      = {2020}
    }
    

    FQuAD

    @article{dHoffschmidt2020FQuADFQ,
      title={FQuAD: French Question Answering Dataset},
      author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
      journal={ArXiv},
      year={2020},
      volume={abs/2002.06071}
    }
    

    SQuAD-FR

     @MISC{kabbadj2018,
       author =       "Kabbadj, Ali",
       title =        "Something new in French Text Mining and Information Extraction (Universal Chatbot): Largest Q&A French training dataset (110 000+) ",
       editor =       "linkedin.com",
       month =        "November",
       year =         "2018",
       url =          "\url{https://www.linkedin.com/pulse/something-new-french-text-mining-information-chatbot-largest-kabbadj/}",
       note =         "[Online; posted 11-November-2018]",
     }
    

    CamemBERT

    HF模型卡片: https://huggingface.co/camembert-base

    @inproceedings{martin2020camembert,
      title={CamemBERT: a Tasty French Language Model},
      author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
      booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
      year={2020}
    }