DrBERT: 一种鲁棒的预训练模型，用于医学和临床领域的法语

近年来，预训练语言模型（PLMs）在各种自然语言处理（NLP）任务中取得了最佳性能。虽然最早的模型是在通用领域数据上进行训练的，但随后出现了更专门针对特定领域的模型，以更有效地处理特定领域的任务。在本文中，我们在法语医学领域提出了对PLMs的独特研究。我们首次比较了在公开数据和医疗机构私有数据上训练的PLMs的性能。我们还评估了不同的学习策略在一组生物医学任务上的表现。最后，我们发布了第一个针对法语生物医学领域的专用PLMs，称为DrBERT，以及用于训练这些模型的最大免费许可医学数据语料库。

CAS：带有临床病例的法语语料库

Train	Dev	Test
Documents	5,306	1,137	1,137

ESSAIS（Dalloux等，2021）和CAS（Grabar等，2018）语料库分别包含13,848个和7,580个法语临床病例。部分临床病例附带有讨论内容。整个案例集的一个子集还配有形态句法（词性标注，词形还原）和语义（UMLS概念，否定，不确定性）注释。在我们的案例中，我们仅关注词性标注任务。

模型度量

 precision    recall  f1-score   support

         ABR     0.8683    0.8480    0.8580       171
         ADJ     0.9634    0.9751    0.9692      4018
         ADV     0.9935    0.9849    0.9892       926
     DET:ART     0.9982    0.9997    0.9989      3308
     DET:POS     1.0000    1.0000    1.0000       133
         INT     1.0000    0.7000    0.8235        10
         KON     0.9883    0.9976    0.9929       845
         NAM     0.9144    0.9353    0.9247       834
         NOM     0.9827    0.9803    0.9815      7980
         NUM     0.9825    0.9845    0.9835      1422
     PRO:DEM     0.9924    1.0000    0.9962       131
     PRO:IND     0.9630    1.0000    0.9811        78
     PRO:PER     0.9948    0.9931    0.9939       579
     PRO:REL     1.0000    0.9908    0.9954       109
         PRP     0.9989    0.9982    0.9985      3785
     PRP:det     1.0000    0.9985    0.9993       681
         PUN     0.9996    0.9958    0.9977      2376
     PUN:cit     0.9756    0.9524    0.9639        84
        SENT     1.0000    0.9974    0.9987      1174
         SYM     0.9495    1.0000    0.9741        94
    VER:cond     1.0000    1.0000    1.0000        11
    VER:futu     1.0000    0.9444    0.9714        18
    VER:impf     1.0000    0.9963    0.9981       804
    VER:infi     1.0000    0.9585    0.9788       193
    VER:pper     0.9742    0.9564    0.9652      1261
    VER:ppre     0.9617    0.9901    0.9757       203
    VER:pres     0.9833    0.9904    0.9868       830
    VER:simp     0.9123    0.7761    0.8387        67
    VER:subi     1.0000    0.7000    0.8235        10
    VER:subp     1.0000    0.8333    0.9091        18

    accuracy                         0.9842     32153
   macro avg     0.9799    0.9492    0.9623     32153
weighted avg     0.9843    0.9842    0.9842     32153

引用的BibTeX

@inproceedings{labrak2023drbert,
    title = {{DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains}},
    author = {Labrak, Yanis and Bazoge, Adrien and Dufour, Richard and Rouvier, Mickael and Morin, Emmanuel and Daille, Béatrice and Gourraud, Pierre-Antoine},
    booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL'23), Long Paper},
    month = july,
    year = 2023,
    address = {Toronto, Canada},
    publisher = {Association for Computational Linguistics}
}

作者:

DrBERT

数据集大小:

422.92 MB