模型:
sentence-transformers/multi-qa-mpnet-base-dot-v1
任务:
句子相似度数据集:
flax-sentence-embeddings/stackexchange_xml ms_marco gooaq yahoo_answers_topics search_qa eli5 natural_questions trivia_qa embedding-data/QQP embedding-data/PAQ_pairs embedding-data/Amazon-QA embedding-data/WikiAnswers 3Aembedding-data/WikiAnswers 3Aembedding-data/Amazon-QA 3Aembedding-data/PAQ_pairs 3Aembedding-data/QQP 3Atrivia_qa 3Anatural_questions 3Aeli5 3Asearch_qa 3Ayahoo_answers_topics 3Agooaq 3Ams_marco 3Aflax-sentence-embeddings/stackexchange_xml这是一个 sentence-transformers 模型:它将句子和段落映射到一个768维的稠密向量空间,并设计用于语义搜索。它已经在来自多种来源的215M个(问题,答案)对上进行了训练。关于语义搜索的介绍,请参考: SBERT.net - Semantic Search
使用该模型非常简单,只需安装 sentence-transformers 即可:
pip install -U sentence-transformers
然后可以像这样使用模型:
from sentence_transformers import SentenceTransformer, util query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] #Load the model model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1') #Encode query and documents query_emb = model.encode(query) doc_emb = model.encode(docs) #Compute dot score between query and all document embeddings scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
如果没有 sentence-transformers ,可以像这样使用模型:首先,将输入通过变换器模型,然后必须在上下文词嵌入之上应用正确的汇集操作。
from transformers import AutoTokenizer, AutoModel import torch #CLS Pooling - Take output from first token def cls_pooling(model_output): return model_output.last_hidden_state[:,0] #Encode text def encode(texts): # Tokenize sentences encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input, return_dict=True) # Perform pooling embeddings = cls_pooling(model_output) return embeddings # Sentences we want sentence embeddings for query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-mpnet-base-dot-v1") model = AutoModel.from_pretrained("sentence-transformers/multi-qa-mpnet-base-dot-v1") #Encode query and docs query_emb = encode(query) doc_emb = encode(docs) #Compute dot score between query and all document embeddings scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
以下是有关如何使用此模型的一些技术细节:
Setting | Value |
---|---|
Dimensions | 768 |
Produces normalized embeddings | No |
Pooling-Method | CLS pooling |
Suitable score functions | dot-product (e.g. util.dot_score ) |
该项目旨在使用自监督对比学习目标在非常大的句子级数据集上对句子嵌入模型进行训练。我们使用对比学习目标:针对一对句子中的一个句子,模型应该预测在我们的数据集中是与之配对的一组随机抽样的其他句子中的哪一个。
我们在Hugging Face组织的 Community week using JAX/Flax for NLP & CV 中开发了这个模型,并作为项目 Train the Best Sentence Embedding Model Ever with 1B Training Pairs 的一部分开发了这个模型。我们从谷歌的Flax、JAX和Cloud团队成员那里获得了关于高效深度学习框架的干预,以及运行该项目的7个TPU v3-8的高效硬件基础设施的帮助。
我们的模型用于语义搜索:它将查询/问题和文本段落编码为稠密向量空间,找到与给定段落相关的文档。
请注意,有一个512个词块的限制:超过这个限制的文本将被截断。此外,请注意,该模型仅在输入文本长达250个词块的情况下进行了训练。对于更长的文本可能效果不佳。
完整的训练脚本在当前存储库中可访问:train_script.py。
我们使用了预训练的 mpnet-base 模型。有关预训练过程的详细信息,请参阅模型卡。
训练我们使用多个数据集的连接来微调我们的模型。总共我们有大约215M个(问题,答案)对。我们根据权重概率对每个数据集进行采样,其配置详见data_config.json文件。
该模型使用了 MultipleNegativesRankingLoss 进行训练,使用CLS汇聚,点积作为相似度函数,缩放比例为1。
Dataset | Number of training tuples |
---|---|
12311321 Duplicate question pairs from WikiAnswers | 77,427,422 |
12312321 Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia | 64,371,441 |
12313321 (Title, Body) pairs from all StackExchanges | 25,316,456 |
12313321 (Title, Answer) pairs from all StackExchanges | 21,396,559 |
12315321 Triplets (query, answer, hard_negative) for 500k queries from Bing search engine | 17,579,773 |
12316321 (query, answer) pairs for 3M Google queries and Google featured snippet | 3,012,496 |
12317321 (Question, Answer) pairs from Amazon product pages | 2,448,839 |
12318321 (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
12318321 (Question, Answer) pairs from Yahoo Answers | 681,164 |
12318321 (Title, Question) pairs from Yahoo Answers | 659,896 |
12321321 (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question | 582,261 |
12322321 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) | 325,475 |
12313321 Duplicate questions pairs (titles) | 304,525 |
12324321 (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset | 103,663 |
12325321 (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph | 100,231 |
12326321 (Question, Paragraph) pairs from SQuAD2.0 dataset | 87,599 |
12327321 (Question, Evidence) pairs | 73,346 |
Total | 214,988,242 |