模型:
sentence-transformers/multi-qa-distilbert-dot-v1
这是一个 sentence-transformers 模型:它将句子和段落映射到一个768维的密集向量空间,旨在进行语义搜索。它是通过215M个来自不同来源的(问题,答案)对进行训练的。要了解语义搜索,请参阅: SBERT.net - Semantic Search
当你安装了 sentence-transformers 后,使用这个模型变得很容易:
pip install -U sentence-transformers
然后,你可以像这样使用模型:
from sentence_transformers import SentenceTransformer, util query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] #Load the model model = SentenceTransformer('sentence-transformers/multi-qa-distilbert-dot-v1') #Encode query and documents query_emb = model.encode(query) doc_emb = model.encode(docs) #Compute dot score between query and all document embeddings scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
如果没有 sentence-transformers ,你可以像这样使用模型:首先,你通过变换器模型传递你的输入,然后你必须在上下文化的词嵌入之上应用正确的池化操作。
from transformers import AutoTokenizer, AutoModel import torch #CLS Pooling - Take output from first token def cls_pooling(model_output): return model_output.last_hidden_state[:,0] #Encode text def encode(texts): # Tokenize sentences encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input, return_dict=True) # Perform pooling embeddings = cls_pooling(model_output) return embeddings # Sentences we want sentence embeddings for query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-distilbert-dot-v1") model = AutoModel.from_pretrained("sentence-transformers/multi-qa-distilbert-dot-v1") #Encode query and docs query_emb = encode(query) doc_emb = encode(docs) #Compute dot score between query and all document embeddings scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
以下是关于如何使用该模型的一些技术细节:
Setting | Value |
---|---|
Dimensions | 768 |
Produces normalized embeddings | No |
Pooling-Method | CLS pooling |
Suitable score functions | dot-product (e.g. util.dot_score ) |
该项目旨在使用自监督对比学习目标在非常大的句子级数据集上训练句子嵌入模型。我们使用对比学习目标:给定一对句子中的一个句子,模型应该预测我们数据集中与之实际配对的一组随机抽样的其他句子。
我们在 Community week using JAX/Flax for NLP & CV 期间开发了这个模型,由Hugging Face组织。我们将这个模型作为项目 Train the Best Sentence Embedding Model Ever with 1B Training Pairs 的一部分进行开发。我们从Google的Flax、JAX和Cloud团队成员那里获得了关于高效深度学习框架的介入,还受益于运行该项目的7个TPU v3-8的高效硬件基础设施。
我们的模型用于语义搜索:它将查询/问题和文本段落编码成一个密集的向量空间。它找到给定段落的相关文档。
请注意,有一个512个词片段的限制:超过这个限制的文本将被截断。还请注意,该模型只是根据长度最长为250个词片段的输入文本进行训练的。对于更长的文本可能效果不好。
完整的训练脚本可在当前存储库中找到:train_script.py。
我们使用了预训练的 distilbert-base-uncased 模型。有关预训练过程的更详细信息,请参阅模型卡片。
Training我们使用多个数据集的串联来微调我们的模型。总共有大约215M个(问题,答案)对。我们按照数据_config.json文件中详细说明的加权概率对每个数据集进行采样。
该模型使用CLS池化、点积作为相似性函数和1的缩放进行了 MultipleNegativesRankingLoss 次训练。
Dataset | Number of training tuples |
---|---|
12311321 Duplicate question pairs from WikiAnswers | 77,427,422 |
12312321 Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia | 64,371,441 |
12313321 (Title, Body) pairs from all StackExchanges | 25,316,456 |
12313321 (Title, Answer) pairs from all StackExchanges | 21,396,559 |
12315321 Triplets (query, answer, hard_negative) for 500k queries from Bing search engine | 17,579,773 |
12316321 (query, answer) pairs for 3M Google queries and Google featured snippet | 3,012,496 |
12317321 (Question, Answer) pairs from Amazon product pages | 2,448,839 |
12318321 (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
12318321 (Question, Answer) pairs from Yahoo Answers | 681,164 |
12318321 (Title, Question) pairs from Yahoo Answers | 659,896 |
12321321 (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question | 582,261 |
12322321 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) | 325,475 |
12313321 Duplicate questions pairs (titles) | 304,525 |
12324321 (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset | 103,663 |
12325321 (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph | 100,231 |
12326321 (Question, Paragraph) pairs from SQuAD2.0 dataset | 87,599 |
12327321 (Question, Evidence) pairs | 73,346 |
Total | 214,988,242 |