FremyCompany/BioLORD-STAMB2-v1

该模型是使用BioLORD进行训练的，BioLORD是一种新的预训练策略，用于生成临床句子和生物医学概念的有意义的表示。

最先进的方法通过最大化表示同一概念的名称的相似性，并通过对比学习防止崩溃。然而，由于生物医学名称并不总是自说明的，有时会导致非语义表示。

BioLORD通过使用定义以及从多关系知识图中得出的生物医学本体的简短描述来解决这个问题。由于这种基础，我们的模型生成了更多语义概念表示，更接近本体的层次结构。BioLORD在临床句子（MedSTS）和生物医学概念（MayoSRS）的文本相似性上建立了新的技术水平。

该模型基于 sentence-transformers/all-mpnet-base-v2 ，并在 BioLORD-Dataset 上进行了进一步的微调。

通用用途

这是一个 sentence-transformers 模型：它将句子和段落映射到一个768维的稠密向量空间，并可用于聚类或语义搜索等任务。该模型已针对生物医学领域进行了微调。虽然它保持了处理通用文本的良好能力，但如果您正在处理医疗文件（如电子病历记录或临床笔记），它对您将更有用。句子和短语可以嵌入到相同的潜在空间中。

引用

该模型是EMNLP 2022 Findings中接受的 BioLORD: Learning Ontological Representations from Definitions 论文的附带产品。当您使用该模型时，请引用原始论文如下：

@inproceedings{remy-etal-2022-biolord,
    title = "{B}io{LORD}: Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions",
    author = "Remy, François  and
      Demuynck, Kris  and
      Demeester, Thomas",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.104",
    pages = "1454--1465",
    abstract = "This work introduces BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts. State-of-the-art methodologies operate by maximizing the similarity in representation of names referring to the same concept, and preventing collapse through contrastive learning. However, because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations. BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph consisting of biomedical ontologies. Thanks to this grounding, our model produces more semantic concept representations that match more closely the hierarchical structure of ontologies. BioLORD establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (MayoSRS).",
}

您还可以查看我们的MWE 2023论文：

Detecting Idiomatic Multiword Expressions in Clinical Terminology using Definition-Based Representation Learning

用法（Sentence-Transformers）

如果已安装 sentence-transformers ，则使用该模型变得很容易：

pip install -U sentence-transformers

然后您可以像这样使用模型：

from sentence_transformers import SentenceTransformer
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

model = SentenceTransformer('FremyCompany/BioLORD-STAMB2-v1')
embeddings = model.encode(sentences)
print(embeddings)

用法（HuggingFace Transformers）

如果没有 sentence-transformers ，则可以像这样使用模型：首先，将输入传递给变换器模型，然后必须在上下文化的单词嵌入之上应用正确的汇集操作。

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)

许可

对于这个模型，我自己的贡献受到MIT许可证的保护。然而，由于用于训练这个模型的数据源自UMLS，因此在使用此模型之前，您需要确保您具有UMLS的适当许可证。在大多数国家，UMLS是免费的，但您可能需要创建一个帐户并每年报告您对数据的使用情况以保持有效的许可证。

作者:

François Remy

数据集大小:

418.41 MB