模型:
sentence-transformers/multi-qa-distilbert-cos-v1
任务:
句子相似度数据集:
flax-sentence-embeddings/stackexchange_xml ms_marco gooaq yahoo_answers_topics search_qa eli5 natural_questions trivia_qa embedding-data/QQP embedding-data/PAQ_pairs embedding-data/Amazon-QA embedding-data/WikiAnswers 3Aembedding-data/WikiAnswers 3Aembedding-data/Amazon-QA 3Aembedding-data/PAQ_pairs 3Aembedding-data/QQP 3Atrivia_qa 3Anatural_questions 3Aeli5 3Asearch_qa 3Ayahoo_answers_topics 3Agooaq 3Ams_marco 3Aflax-sentence-embeddings/stackexchange_xml这是一个 sentence-transformers 模型:它将句子和段落映射到一个768维的密集向量空间中,旨在进行语义搜索。它是通过训练来自各种来源的2.15亿个(问题,答案)对来完成的。如果想了解语义搜索的介绍,请查看: SBERT.net - Semantic Search
安装了 sentence-transformers 后,使用这个模型就变得很容易:
pip install -U sentence-transformers
然后你可以像这样使用该模型:
from sentence_transformers import SentenceTransformer, util query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] #Load the model model = SentenceTransformer('sentence-transformers/multi-qa-distilbert-cos-v1') #Encode query and documents query_emb = model.encode(query) doc_emb = model.encode(docs) #Compute dot score between query and all document embeddings scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
如果没有 sentence-transformers ,你可以这样使用该模型:首先,将输入传递给变换器模型,然后必须在上下文化的词嵌入之上应用正确的汇集操作。
from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F #Mean Pooling - Take average of all tokens def mean_pooling(model_output, attention_mask): token_embeddings = model_output.last_hidden_state #First element of model_output contains all token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) #Encode text def encode(texts): # Tokenize sentences encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input, return_dict=True) # Perform pooling embeddings = mean_pooling(model_output, encoded_input['attention_mask']) # Normalize embeddings embeddings = F.normalize(embeddings, p=2, dim=1) return embeddings # Sentences we want sentence embeddings for query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-distilbert-cos-v1") model = AutoModel.from_pretrained("sentence-transformers/multi-qa-distilbert-cos-v1") #Encode query and docs query_emb = encode(query) doc_emb = encode(docs) #Compute dot score between query and all document embeddings scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
以下是关于如何使用该模型的一些技术细节:
Setting | Value |
---|---|
Dimensions | 768 |
Produces normalized embeddings | Yes |
Pooling-Method | Mean pooling |
Suitable score functions | dot-product ( util.dot_score ), cosine-similarity ( util.cos_sim ), or euclidean distance |
注意:当使用 sentence-transformers 加载此模型时,该模型会产生长度为1的归一化嵌入向量。在这种情况下,点积和余弦相似度是等效的,但由于点积更快,所以推荐使用点积。欧几里德距离与点积成比例,也可以使用。
该项目旨在使用自监督对比学习目标在非常大的句子级数据集上训练句子嵌入模型。我们使用对比学习目标:给定一对句子中的一句,模型应该预测在我们的数据集中与其实际配对的其他随机抽样的句子。
我们在 Community week using JAX/Flax for NLP & CV 期间开发了这个模型,该期间由Hugging Face组织。我们开发这个模型作为项目的一部分: Train the Best Sentence Embedding Model Ever with 1B Training Pairs 。我们受益于高效的硬件基础设施来运行该项目:7个TPU v3-8,以及来自Google的Flax、JAX和Cloud团队成员关于高效深度学习框架的干预。
我们的模型用于语义搜索:它将查询/问题和文本段落编码为密集向量空间。它会找到与给定段落相关的文档。
注意,最多可以输入512个词块:超过这个限制的文本将被截断。还要注意,该模型仅在最多包含250个词块的输入文本上进行了训练。对于更长的文本,它可能效果不好。
完整的训练脚本在当前仓库中可访问:train_script.py 。
我们使用了预训练的 distilbert-base-uncased 模型。有关预训练过程的更详细信息,请参阅模型卡片。
Training我们使用多个数据集的连接来微调我们的模型。总共有约2.15亿个(问题,答案)对。我们根据详细配置在 data_config.json 文件中的加权概率对每个数据集进行了采样。
该模型使用了 MultipleNegativesRankingLoss 进行训练,使用了均值汇聚、余弦相似度作为相似性函数,以及规模为20.
Dataset | Number of training tuples |
---|---|
12311321 Duplicate question pairs from WikiAnswers | 77,427,422 |
12312321 Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia | 64,371,441 |
12313321 (Title, Body) pairs from all StackExchanges | 25,316,456 |
12313321 (Title, Answer) pairs from all StackExchanges | 21,396,559 |
12315321 Triplets (query, answer, hard_negative) for 500k queries from Bing search engine | 17,579,773 |
12316321 (query, answer) pairs for 3M Google queries and Google featured snippet | 3,012,496 |
12317321 (Question, Answer) pairs from Amazon product pages | 2,448,839 |
12318321 (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
12318321 (Question, Answer) pairs from Yahoo Answers | 681,164 |
12318321 (Title, Question) pairs from Yahoo Answers | 659,896 |
12321321 (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question | 582,261 |
12322321 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) | 325,475 |
12313321 Duplicate questions pairs (titles) | 304,525 |
12324321 (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset | 103,663 |
12325321 (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph | 100,231 |
12326321 (Question, Paragraph) pairs from SQuAD2.0 dataset | 87,599 |
12327321 (Question, Evidence) pairs | 73,346 |
Total | 214,988,242 |