模型:
sentence-transformers/multi-qa-MiniLM-L6-dot-v1
这是一个模型:它将句子和段落映射到一个384维的稠密向量空间,并用于语义搜索。它使用来自不同来源的2.15亿个(问题,答案)对进行训练。如果要了解语义搜索的介绍,请查看: SBERT.net - Semantic Search
当您安装了 sentence-transformers 之后,使用该模型将变得很容易:
pip install -U sentence-transformers
然后,您可以像这样使用该模型:
from sentence_transformers import SentenceTransformer, util
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]
#Load the model
model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-dot-v1')
#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)
#Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))
#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
#Output passages & scores
for doc, score in doc_score_pairs:
print(score, doc)
在没有 sentence-transformers 的情况下,您可以像这样使用该模型:首先,将输入通过变换器模型传递,然后必须在上下文化的词嵌入之上应用正确的汇集操作。
from transformers import AutoTokenizer, AutoModel
import torch
#CLS Pooling - Take output from first token
def cls_pooling(model_output):
return model_output.last_hidden_state[:,0]
#Encode text
def encode(texts):
# Tokenize sentences
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input, return_dict=True)
# Perform pooling
embeddings = cls_pooling(model_output)
return embeddings
# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-dot-v1")
model = AutoModel.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-dot-v1")
#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)
#Compute dot score between query and all document embeddings
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))
#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
#Output passages & scores
for doc, score in doc_score_pairs:
print(score, doc)
以下是关于如何使用此模型的一些技术细节:
| Setting | Value |
|---|---|
| Dimensions | 384 |
| Produces normalized embeddings | No |
| Pooling-Method | CLS pooling |
| Suitable score functions | dot-product (e.g. util.dot_score ) |
该项目旨在使用自我监督的对比学习目标在非常大的句子级别数据集上训练句子嵌入模型。我们使用对比学习目标:对于一对句子,模型应该预测在我们的数据集中,哪些随机抽样的其他句子实际上是与之配对的。
我们在由Hugging Face组织的 Community week using JAX/Flax for NLP & CV 中开发了这个模型。我们开发的这个模型是项目 Train the Best Sentence Embedding Model Ever with 1B Training Pairs 的一部分。我们从谷歌的Flax、JAX和Cloud团队成员那里获得了关于高效深度学习框架的介入,以及运行项目所需的7个TPU v3-8的高效硬件基础架构。
我们的模型旨在用于语义搜索:它将查询/问题和文本段落编码成稠密向量空间,找到相关的文档。
请注意,512个词片段是有限制的:超过这个长度的文本将被截断。此外,请注意,该模型仅在最多250个词片段的输入文本上进行了训练。对于更长的文本可能效果不佳。
完整的训练脚本可以在此当前存储库中访问:train_script.py。
我们使用了预训练的 nreimers/MiniLM-L6-H384-uncased 模型。有关预训练过程的更详细信息,请参阅模型卡。
训练我们使用多个数据集的串联来微调我们的模型。总共有约2.15亿个(问题,答案)对。我们根据权重概率对每个数据集进行了采样,其配置详见data_config.json文件。
该模型使用CLS池化、点积作为相似度函数和缩放因子1进行了训练。
| Dataset | Number of training tuples |
|---|---|
| 12311321 Duplicate question pairs from WikiAnswers | 77,427,422 |
| 12312321 Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia | 64,371,441 |
| 12313321 (Title, Body) pairs from all StackExchanges | 25,316,456 |
| 12313321 (Title, Answer) pairs from all StackExchanges | 21,396,559 |
| 12315321 Triplets (query, answer, hard_negative) for 500k queries from Bing search engine | 17,579,773 |
| 12316321 (query, answer) pairs for 3M Google queries and Google featured snippet | 3,012,496 |
| 12317321 (Question, Answer) pairs from Amazon product pages | 2,448,839 |
| 12318321 (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
| 12318321 (Question, Answer) pairs from Yahoo Answers | 681,164 |
| 12318321 (Title, Question) pairs from Yahoo Answers | 659,896 |
| 12321321 (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question | 582,261 |
| 12322321 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) | 325,475 |
| 12313321 Duplicate questions pairs (titles) | 304,525 |
| 12324321 (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset | 103,663 |
| 12325321 (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph | 100,231 |
| 12326321 (Question, Paragraph) pairs from SQuAD2.0 dataset | 87,599 |
| 12327321 (Question, Evidence) pairs | 73,346 |
| Total | 214,988,242 |