模型:

bionlp/bluebert_pubmed_uncased_L-12_H-768_A-12

英文

BlueBert-Base, Uncased, PubMed

模型描述

一个在PubMed摘要上预训练的BERT模型

预期用途和限制

如何使用

请参考 https://github.com/ncbi-nlp/bluebert

训练数据

我们提供了 preprocessed PubMed texts 用于预训练BlueBERT模型的语料库。该语料库包含从 PubMed ASCII code version 提取的约4000M个词。

预训练模型: https://huggingface.co/bert-base-uncased

训练过程

以下是详细信息的代码段。

value = value.lower()
value = re.sub(r'[\r\n]+', ' ', value)
value = re.sub(r'[^\x00-\x7F]+', ' ', value)

tokenized = TreebankWordTokenizer().tokenize(value)
sentence = ' '.join(tokenized)
sentence = re.sub(r"\s's\b", "'s", sentence)

BibTeX条目和引用信息

@InProceedings{peng2019transfer,
  author    = {Yifan Peng and Shankai Yan and Zhiyong Lu},
  title     = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
  booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
  year      = {2019},
  pages     = {58--65},
}