一个在PubMed摘要上预训练的BERT模型
请参考 https://github.com/ncbi-nlp/bluebert
我们提供了 preprocessed PubMed texts 用于预训练BlueBERT模型的语料库。该语料库包含从 PubMed ASCII code version 提取的约4000M个词。
预训练模型: https://huggingface.co/bert-base-uncased
以下是详细信息的代码段。
value = value.lower() value = re.sub(r'[\r\n]+', ' ', value) value = re.sub(r'[^\x00-\x7F]+', ' ', value) tokenized = TreebankWordTokenizer().tokenize(value) sentence = ' '.join(tokenized) sentence = re.sub(r"\s's\b", "'s", sentence)
@InProceedings{peng2019transfer, author = {Yifan Peng and Shankai Yan and Zhiyong Lu}, title = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets}, booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)}, year = {2019}, pages = {58--65}, }