模型:

PlanTL-GOB-ES/roberta-base-biomedical-clinical-es

任务:

填充掩码

类库:

PyTorch Transformers

语言:

其他:

roberta biomedical clinical spanish AutoTrain Compatible

预印本库:

arxiv:2109.03570 arxiv:2109.07765

许可:

apache-2.0

模型介绍文件清单

英文

西班牙生物医学临床语言模型

模型描述

西班牙语生物医学预训练语言模型。该模型是在从多个来源收集的西班牙语生物医学临床语料库上进行训练的 RoBERTa-based 模型。

使用目的和限制

该模型仅适用于掩码语言建模，用于执行填充掩码任务（尝试使用推理API或阅读下一节）。然而，它旨在在下游任务（如命名实体识别或文本分类）上进行微调。

如何使用

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
from transformers import pipeline
unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
unmasker("El único antecedente personal a reseñar era la <mask> arterial.")

# Output
[
  {
    "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
    "score": 0.9855039715766907,
    "token": 3529,
    "token_str": " hipertensión"
  },
  {
    "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
    "score": 0.0039140828885138035,
    "token": 1945,
    "token_str": " diabetes"
  },
  {
    "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
    "score": 0.002484665485098958,
    "token": 11483,
    "token_str": " hipotensión"
  },
  {
    "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
    "score": 0.0023484621196985245,
    "token": 12238,
    "token_str": " Hipertensión"
  },
  {
    "sequence": " El único antecedente personal a reseñar era la presión arterial.",
    "score": 0.0008009297889657319,
    "token": 2267,
    "token_str": " presión"
  }
]

限制和偏见

在提交时，尚未采取任何措施来评估模型中嵌入的偏见。然而，我们充分意识到我们的模型可能存在偏见，因为这些语料库是使用爬虫技术从多个网络源收集的。我们打算在未来在这些领域进行研究，如果完成，将更新此模型卡片。

训练

训练语料库已使用字节版本的 Byte-Pair Encoding (BPE) 进行了分词处理，并具有52,000个标记的词汇量。预训练采用了掩码语言模型训练，遵循RoBERTa基础模型中采用的方法，使用与原始工作相同的超参数。训练总共持续了48小时，使用了16个NVIDIA V100 GPU，每个GPU有16GB DDRAM，使用Adam优化器，峰值学习率为0.0005，有效批量大小为2,048个句子。

训练语料库由几个西班牙语生物医学语料库组成，这些语料库是从公开可用的语料库和爬虫中收集的，以及从超过278K份临床文档和笔记中收集的真实世界临床语料库。为了获得高质量的训练语料库并保留临床语言的个性特征，仅对生物医学语料库进行了清理操作，而保持临床语料库的未清理状态。基本上，使用的清理操作包括：

不同格式的数据解析
句子拆分
语言检测
过滤格式错误的句子
去重重复内容
保留原始文档边界

然后，将生物医学语料库串联起来，并对生物医学语料库进行全局去重处理。最后，将临床语料库与经过清理的生物医学语料库连接起来，形成一个由超过10亿个标记组成的中等规模的西班牙文生物医学临床语料库。下表显示了各个已清理语料库的一些基本统计信息：

Name	No. tokens	Description
1237321	745,705,946	Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains.
Clinical cases misc.	102,855,267	A miscellany of medical content, essentially clinical cases. Note that a clinical case report is a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document.
Clinical notes/documents	91,250,080	Collection of more than 278K clinical documents, including discharge reports, clinical course notes and X-ray reports, for a total of 91M tokens.
1238321	60,007,289	Publications written in Spanish crawled from the Spanish SciELO server in 2017.
1239321	24,516,442	Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines.
Wikipedia_life_sciences	13,890,501	Wikipedia articles crawled 04/01/2021 with the 12310321 starting from the "Ciencias_de_la_vida" category up to a maximum of 5 subcategories. Multiple links to the same articles are then discarded to avoid repeating content.
Patents	13,463,387	Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P".
12311321	5,377,448	Spanish-side documents extracted from parallel corpora made out of PDF documents from the European Medicines Agency.
12312321	4,166,077	Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature. The collection of parallel resources are aggregated from the MedlinePlus source.
PubMed	1,858,966	Open-access articles from the PubMed repository crawled in 2017.

评估

该模型已在命名实体识别（NER）上使用以下数据集进行评估：

PharmaCoNER ：是关于西班牙医学文本中化学物质和药物提及识别的任务（有关更多信息，请参阅： https://temu.bsc.es/pharmaconer/ ）。
CANTEMIST ：是专门针对西班牙语肿瘤形态学命名实体识别的共享任务（有关更多信息，请参阅： https://zenodo.org/record/3978041#.YTt5qH2xXbQ ）。
ICTUSnet：包含来自18家不同西班牙医院的1006份中风住院报告。它包含51种不同变量的超过79,000个注释。

评估结果与 mBERT 和 BETO 模型进行了比较：

F1 - Precision - Recall	roberta-base-biomedical-clinical-es	mBERT	BETO
PharmaCoNER	90.04 - 88.92 - 91.18	87.46 - 86.50 - 88.46	88.18 - 87.12 - 89.28
CANTEMIST	83.34 - 81.48 - 85.30	82.61 - 81.12 - 84.15	82.42 - 80.91 - 84.00
ICTUSnet	88.08 - 84.92 - 91.50	86.75 - 83.53 - 90.23	85.95 - 83.10 - 89.02

附加信息

作者

巴塞罗那超级计算中心（BSC）的文本挖掘小组（bsc-temu@bsc.es）

联系信息

如需更多信息，请发送电子邮件至plantl-gob-es@bsc.es

版权

许可信息

Apache License, Version 2.0

资助

该工作由西班牙数字化与人工智能国家秘书处（SEDIA）在计划-密支框架内资助。

引用信息

如果您使用了我们的模型，请引用我们的最新预印版：

@misc{carrino2021biomedical,
      title={Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario}, 
      author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Asier Gutiérrez-Fandiño and Joan Llop-Palao and Marc Pàmies and Aitor Gonzalez-Agirre and Marta Villegas},
      year={2021},
      eprint={2109.03570},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

如果您使用了我们的医学爬虫语料库，请引用该预印版：

@misc{carrino2021spanish,
      title={Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models}, 
      author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Ona de Gibert Bonet and Asier Gutiérrez-Fandiño and Aitor Gonzalez-Agirre and Martin Krallinger and Marta Villegas},
      year={2021},
      eprint={2109.07765},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

免责声明

点击展开

此存储库中发布的模型仅用于通用目的，可供第三方使用。这些模型可能存在偏见和/或其他不可取的失真。

当第三方使用这些模型（或基于这些模型的系统）部署或提供系统和/或服务给其他方，或成为这些模型的用户时，他们应注意，减轻由其使用引起的风险是他们的责任，并且在任何情况下，都必须遵守适用法规，包括关于使用人工智能的法规。

在任何情况下，模型所有人（SEDIA-国家数字化与人工智能秘书处）及创建者（BSC-巴塞罗那超级计算中心）不对第三方使用这些模型产生的结果承担任何责任。

Los模型共有利益，简单地提供第三方。这些模型可能有偏见和/或其他不希望的扭曲。

当第三方使用任何这些模型（或基于这些模型的系统）提供系统和/或服务给其他方或成为该模型的用户时，他们应该注意，减轻使用引起的风险是他们的责任，并且在任何情况下，都应遵守适用法规，包括关于使用人工智能的法规。

在任何情况下，模型所有者（SEDIA - 国家数字化与人工智能秘书处）和作者（BSC - 巴塞罗那超级计算中心）对由第三方对这些模型进行使用所产生的结果不承担任何责任。

作者:

Plan de Tecnologías del Lenguaje - Gobierno de España

数据集大小:

483.23 MB