模型:
MilosKosRad/BioNER
该模型是拜耳制药公司与塞尔维亚人工智能研究与开发学院合作研究期间创建的。该模型在26个生物医学命名实体(NE)类上进行训练,并可以执行零样本推理。它还可以通过仅使用少量示例(少样本学习)对新类进行进一步微调。关于我们方法的更多细节,请参阅名为 "A transformer-based method for zero and few-shot biomedical named entity recognition" 的论文。该模型对应于基于PubMedBERT的模型,该模型在第一个段落中使用1进行了训练(有关更多细节,请查阅论文)。
模型接受两个字符串作为输入。String1是要在第二个字符串中搜索的NE标签。String2是要搜索NE(由String1表示)的短文本。模型输出String2的元素列表,其中一代表找到的命名实体,零(0)代表其他非NE标记。
from transformers import AutoTokenizer from transformers import BertForTokenClassification modelname = 'MilosKorsRad/BioNER' # modelpath tokenizer = AutoTokenizer.from_pretrained(modelname) ## loading the tokenizer of the model string1 = 'Drug' string2 = 'No recent antibiotics or other nephrotoxins, and no symptoms of UTI with benign UA.' encodings = tokenizer(string1, string2, is_split_into_words=False, padding=True, truncation=True, add_special_tokens=True, return_offsets_mapping=False, max_length=512, return_tensors='pt') model0 = BertForTokenClassification.from_pretrained(modelname, num_labels=2) prediction_logits = model0(**encodings) print(prediction_logits)
要使用少样本微调模型进行新实体的微调,需要将数据集转换为torch.utils.data.Dataset,其中包含BERT标记和一组0和1(其中1表示该类为正类,并应被预测为该给定NE类的成员)。创建数据集后,可以执行以下操作(有关详细信息,请参阅GitHub上的代码 https://github.com/br-ai-ns-institute/Zero-ShotNER )。
for i in [train1shot, train10shot, train100shot]: training_args = TrainingArguments( output_dir='./Results'+class_unseen+'FewShot'+str(i), # output folder (folder to store the results) num_train_epochs=10, # number of training epochs per_device_train_batch_size=16, # batch size per device during training per_device_eval_batch_size=16, # batch size for evaluation weight_decay=0.01, # strength of weight decay logging_dir='./Logs'+class_unseen+'FewShot'+str(i), # folder to store the logs save_strategy='epoch', evaluation_strategy='epoch', load_best_model_at_end=True ) model0 = BertForTokenClassification.from_pretrained(model_path, num_labels=2) trainer = Trainer( model=model0, # pre-trained model for fine-tuning args=training_args, # training arguments defined above train_dataset=train_0shot, # dataset class object for training eval_dataset=valid_dataset # dataset class object for validation ) start_time = time.time() trainer.train() total_time = time.time()-start_time model_path = os.path.join('Results', class_unseen, 'FewShot',str(i), 'Model') os.makedirs(model_path, exist_ok=True) model.save_pretrained(model_path) tokenizer_path = os.path.join('Results', class_unseen, 'FewShot', str(i), 'Tokenizer') os.makedirs(tokenizer_path, exist_ok=True) tokenizer.save_pretrained(tokenizer_path)
以下数据集和实体用于训练,因此可以在第一个段落中用作标签(作为第一个字符串)。请注意,多个词的字符串已合并。
除此之外,您还可以将该模型用于其他类别的零样本学习,并使用其他类别的少量示例对其进行微调。
用于训练和测试模型的代码可在 https://github.com/br-ai-ns-institute/Zero-ShotNER 处获得。
如果您使用此模型或从中得到启发,请在您的论文中引用以下论文:
Košprdić M., Prodanović N., Ljajić A., Bašaragin B., Milošević N., 2023. 基于Transformer的零和少样本生物医学命名实体识别方法. arXiv预印本arXiv:2305.04928. https://arxiv.org/abs/2305.04928
或者用bibtex格式引用:
@misc{kosprdic2023transformerbased, title={A transformer-based method for zero and few-shot biomedical named entity recognition}, author={Miloš Košprdić and Nikola Prodanović and Adela Ljajić and Bojana Bašaragin and Nikola Milošević}, year={2023}, eprint={2305.04928}, archivePrefix={arXiv}, primaryClass={cs.CL} }