西班牙语生物医学预训练语言模型。该模型是在从多个来源收集的西班牙语生物医学临床语料库上进行训练的 RoBERTa-based 模型。
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es") model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es") from transformers import pipeline unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es") unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
# Output [ { "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.", "score": 0.9855039715766907, "token": 3529, "token_str": " hipertensión" }, { "sequence": " El único antecedente personal a reseñar era la diabetes arterial.", "score": 0.0039140828885138035, "token": 1945, "token_str": " diabetes" }, { "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.", "score": 0.002484665485098958, "token": 11483, "token_str": " hipotensión" }, { "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.", "score": 0.0023484621196985245, "token": 12238, "token_str": " Hipertensión" }, { "sequence": " El único antecedente personal a reseñar era la presión arterial.", "score": 0.0008009297889657319, "token": 2267, "token_str": " presión" } ]
训练语料库已使用字节版本的 Byte-Pair Encoding (BPE) 进行了分词处理,并具有52,000个标记的词汇量。预训练采用了掩码语言模型训练,遵循RoBERTa基础模型中采用的方法,使用与原始工作相同的超参数。训练总共持续了48小时,使用了16个NVIDIA V100 GPU,每个GPU有16GB DDRAM,使用Adam优化器,峰值学习率为0.0005,有效批量大小为2,048个句子。
Name | No. tokens | Description |
1237321 | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains. |
Clinical cases misc. | 102,855,267 | A miscellany of medical content, essentially clinical cases. Note that a clinical case report is a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document. |
Clinical notes/documents | 91,250,080 | Collection of more than 278K clinical documents, including discharge reports, clinical course notes and X-ray reports, for a total of 91M tokens. |
1238321 | 60,007,289 | Publications written in Spanish crawled from the Spanish SciELO server in 2017. |
1239321 | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines. |
Wikipedia_life_sciences | 13,890,501 | Wikipedia articles crawled 04/01/2021 with the 12310321 starting from the "Ciencias_de_la_vida" category up to a maximum of 5 subcategories. Multiple links to the same articles are then discarded to avoid repeating content. |
Patents | 13,463,387 | Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P". |
12311321 | 5,377,448 | Spanish-side documents extracted from parallel corpora made out of PDF documents from the European Medicines Agency. |
12312321 | 4,166,077 | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature. The collection of parallel resources are aggregated from the MedlinePlus source. |
PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
PharmaCoNER :是关于西班牙医学文本中化学物质和药物提及识别的任务(有关更多信息,请参阅: https://temu.bsc.es/pharmaconer/ )。
CANTEMIST :是专门针对西班牙语肿瘤形态学命名实体识别的共享任务(有关更多信息,请参阅: https://zenodo.org/record/3978041#.YTt5qH2xXbQ )。
F1 - Precision - Recall | roberta-base-biomedical-clinical-es | mBERT | BETO |
PharmaCoNER | 90.04 - 88.92 - 91.18 | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
CANTEMIST | 83.34 - 81.48 - 85.30 | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
ICTUSnet | 88.08 - 84.92 - 91.50 | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
@misc{carrino2021biomedical, title={Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario}, author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Asier Gutiérrez-Fandiño and Joan Llop-Palao and Marc Pàmies and Aitor Gonzalez-Agirre and Marta Villegas}, year={2021}, eprint={2109.03570}, archivePrefix={arXiv}, primaryClass={cs.CL} }
@misc{carrino2021spanish, title={Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models}, author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Ona de Gibert Bonet and Asier Gutiérrez-Fandiño and Aitor Gonzalez-Agirre and Martin Krallinger and Marta Villegas}, year={2021}, eprint={2109.07765}, archivePrefix={arXiv}, primaryClass={cs.CL} }
在任何情况下,模型所有者(SEDIA - 国家数字化与人工智能秘书处)和作者(BSC - 巴塞罗那超级计算中心)对由第三方对这些模型进行使用所产生的结果不承担任何责任。