模型:

sentence-transformers/multi-qa-MiniLM-L6-dot-v1

英文

multi-qa-MiniLM-L6-dot-v1

这是一个模型:它将句子和段落映射到一个384维的稠密向量空间,并用于语义搜索。它使用来自不同来源的2.15亿个(问题,答案)对进行训练。如果要了解语义搜索的介绍,请查看: SBERT.net - Semantic Search

使用(Sentence-Transformers)

当您安装了 sentence-transformers 之后,使用该模型将变得很容易:

pip install -U sentence-transformers

然后,您可以像这样使用该模型:

from sentence_transformers import SentenceTransformer, util

query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

#Load the model
model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-dot-v1')

#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

#Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

使用(HuggingFace Transformers)

在没有 sentence-transformers 的情况下,您可以像这样使用该模型:首先,将输入通过变换器模型传递,然后必须在上下文化的词嵌入之上应用正确的汇集操作。

from transformers import AutoTokenizer, AutoModel
import torch

#CLS Pooling - Take output from first token
def cls_pooling(model_output):
    return model_output.last_hidden_state[:,0]

#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = cls_pooling(model_output)

    return embeddings


# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-dot-v1")
model = AutoModel.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-dot-v1")

#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

#Compute dot score between query and all document embeddings
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

技术细节

以下是关于如何使用此模型的一些技术细节:

Setting Value
Dimensions 384
Produces normalized embeddings No
Pooling-Method CLS pooling
Suitable score functions dot-product (e.g. util.dot_score )

背景

该项目旨在使用自我监督的对比学习目标在非常大的句子级别数据集上训练句子嵌入模型。我们使用对比学习目标:对于一对句子,模型应该预测在我们的数据集中,哪些随机抽样的其他句子实际上是与之配对的。

我们在由Hugging Face组织的 Community week using JAX/Flax for NLP & CV 中开发了这个模型。我们开发的这个模型是项目 Train the Best Sentence Embedding Model Ever with 1B Training Pairs 的一部分。我们从谷歌的Flax、JAX和Cloud团队成员那里获得了关于高效深度学习框架的介入,以及运行项目所需的7个TPU v3-8的高效硬件基础架构。

预期用途

我们的模型旨在用于语义搜索:它将查询/问题和文本段落编码成稠密向量空间,找到相关的文档。

请注意,512个词片段是有限制的:超过这个长度的文本将被截断。此外,请注意,该模型仅在最多250个词片段的输入文本上进行了训练。对于更长的文本可能效果不佳。

训练过程

完整的训练脚本可以在此当前存储库中访问:train_script.py。

预训练

我们使用了预训练的 nreimers/MiniLM-L6-H384-uncased 模型。有关预训练过程的更详细信息,请参阅模型卡。

训练

我们使用多个数据集的串联来微调我们的模型。总共有约2.15亿个(问题,答案)对。我们根据权重概率对每个数据集进行了采样,其配置详见data_config.json文件。

该模型使用CLS池化、点积作为相似度函数和缩放因子1进行了训练。

Dataset Number of training tuples
12311321 Duplicate question pairs from WikiAnswers 77,427,422
12312321 Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia 64,371,441
12313321 (Title, Body) pairs from all StackExchanges 25,316,456
12313321 (Title, Answer) pairs from all StackExchanges 21,396,559
12315321 Triplets (query, answer, hard_negative) for 500k queries from Bing search engine 17,579,773
12316321 (query, answer) pairs for 3M Google queries and Google featured snippet 3,012,496
12317321 (Question, Answer) pairs from Amazon product pages 2,448,839
12318321 (Title, Answer) pairs from Yahoo Answers 1,198,260
12318321 (Question, Answer) pairs from Yahoo Answers 681,164
12318321 (Title, Question) pairs from Yahoo Answers 659,896
12321321 (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question 582,261
12322321 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) 325,475
12313321 Duplicate questions pairs (titles) 304,525
12324321 (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset 103,663
12325321 (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph 100,231
12326321 (Question, Paragraph) pairs from SQuAD2.0 dataset 87,599
12327321 (Question, Evidence) pairs 73,346
Total 214,988,242