CXR-BERT 是一个胸部X射线(CXR)专用的语言模型,它利用了改进过的词汇、新颖的预训练过程、权重正则化和文本增强技术。这个模型在放射学自然语言推理、放射学掩模语言模型令牌预测以及零样本短语定位和图像分类等下游视觉语言处理任务方面展示了改进的性能。
首先,我们通过对公开可获取的摘要和临床记录中的胸部射线摘要 PubMed 进行掩蔽语言建模(MLM)来预训练 CXR-BERT-general ,从随机初始化的BERT模型开始。在这方面,可以通过领域特定的微调将通用模型应用于除胸部放射学之外的临床领域研究。
CXR-BERT-specialized不断从CXR-BERT-general进行预训练以进一步专注于胸部X射线领域。在最后的阶段,CXR-BERT在多模式对比学习框架中进行训练,类似于 CLIP 框架。利用[CLS]标记的潜在表示来对齐文本/图像嵌入。
Model | Model identifier on HuggingFace | Vocabulary | Note |
CXR-BERT-general | 1238321 | PubMed & MIMIC | Pretrained for biomedical literature and clinical domains |
CXR-BERT-specialized (after multi-modal training) | 1239321 | PubMed & MIMIC | Pretrained for chest X-ray domain |
CXR-BERT-specialized与ResNet-50图像模型一起在多模态对比学习框架中进行联合训练。在多模态学习之前,图像模型使用 SimCLR 在MIMIC-CXR中对相同的图像进行预训练。可以通过我们的 HI-ML-Multimodal GitHub存储库访问相应的模型定义和加载函数。联合图像和文本模型,即 BioViL ,可以用于短语定位应用,如此python notebook example 所示。此外,请查看 MS-CXR benchmark 以对短语定位任务中联合图像和文本模型进行更系统的评估。
相应的论文已被接受在 European Conference on Computer Vision (ECCV) 2022 上发表。
@misc{, doi = {10.48550/ARXIV.2204.09817}, url = {}, author = {Boecking, Benedikt and Usuyama, Naoto and Bannur, Shruthi and Castro, Daniel C. and Schwaighofer, Anton and Hyland, Stephanie and Wetscherek, Maria and Naumann, Tristan and Nori, Aditya and Alvarez-Valle, Javier and Poon, Hoifung and Oktay, Ozan}, title = {Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing}, publisher = {arXiv}, year = {2022}, }
范围外的用途当前模型的任何部署用例(包括商业用途或其他用途)都不在范围之内。虽然我们使用了广泛的公开可获取的研究基准来评估模型,但模型和评估不适用于部署用例。请参考 the associated paper 以了解更多详细信息。
import torch from transformers import AutoModel, AutoTokenizer # Load the model and tokenizer url = "microsoft/BiomedVLP-CXR-BERT-specialized" tokenizer = AutoTokenizer.from_pretrained(url, trust_remote_code=True) model = AutoModel.from_pretrained(url, trust_remote_code=True) # Input text prompts (e.g., reference, synonym, contradiction) text_prompts = ["There is no pneumothorax or pleural effusion", "No pleural effusion or pneumothorax is seen", "The extent of the pleural effusion is constant."] # Tokenize and compute the sentence embeddings tokenizer_output = tokenizer.batch_encode_plus(batch_text_or_text_pairs=text_prompts, add_special_tokens=True, padding='longest', return_tensors='pt') embeddings = model.get_projected_text_embeddings(input_ids=tokenizer_output.input_ids, attention_mask=tokenizer_output.attention_mask) # Compute the cosine similarity of sentence embeddings obtained from input text prompts. sim =, embeddings.t())
与其他常见模型(包括 ClinicalBERT 和 PubMedBERT )相比的一个亮点是:
RadNLI accuracy (MedNLI transfer) | Mask prediction accuracy | Avg. # tokens after tokenization | Vocabulary size | |
RadNLI baseline | 53.30 | - | - | - |
ClinicalBERT | 47.67 | 39.84 | 78.98 (+38.15%) | 28,996 |
PubMedBERT | 57.71 | 35.24 | 63.55 (+11.16%) | 28,895 |
CXR-BERT (after Phase-III) | 60.46 | 77.72 | 58.07 (+1.59%) | 30,522 |
CXR-BERT (after Phase-III + Joint Training) | 65.21 | 81.58 | 58.07 (+1.59%) | 30,522 |
Vision–Language Pretraining Method | Text Encoder | MS-CXR Phrase Grounding (Avg. CNR Score) |
Baseline | ClinicalBERT | 0.769 |
Baseline | PubMedBERT | 0.773 |
ConVIRT | ClinicalBERT | 0.818 |
GLoRIA | ClinicalBERT | 0.930 |
BioViL | CXR-BERT | 1.027 |
BioViL-L | CXR-BERT | 1.142 |
有关性能的其他详细信息可以在相应的论文 Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing 中找到。
有关模型训练和评估的其他详细信息,请参阅相应的论文 "Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing", ECCV'22 。
有关CXR-BERT的其他推理流水线,请参阅 HI-ML-Multimodal GitHub 存储库。