flax-sentence-embeddings/stackexchange_xml ms_marco gooaq yahoo_answers_topics search_qa eli5 natural_questions trivia_qa embedding-data/QQP embedding-data/PAQ_pairs embedding-data/Amazon-QA embedding-data/WikiAnswers 3Aembedding-data/WikiAnswers 3Aembedding-data/Amazon-QA 3Aembedding-data/PAQ_pairs 3Aembedding-data/QQP 3Atrivia_qa 3Anatural_questions 3Aeli5 3Asearch_qa 3Ayahoo_answers_topics 3Agooaq 3Ams_marco 3Aflax-sentence-embeddings/stackexchange_xml这是一个 sentence-transformers 模型:它将句子和段落映射到一个384维的稠密向量空间,并且专为语义搜索设计。它是在来自各种来源的215M个(问题,答案)对上进行训练的。要了解语义搜索的介绍,请查看: SBERT.net - Semantic Search
当您安装了 sentence-transformers 后,使用这个模型变得很容易:
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] #Load the model model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1') #Encode query and documents query_emb = model.encode(query) doc_emb = model.encode(docs) #Compute dot score between query and all document embeddings scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
如果没有 sentence-transformers ,可以像下面这样使用模型:首先,将输入通过变换器模型,然后必须在上下文化的单词嵌入的顶部应用正确的池化操作。
from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F #Mean Pooling - Take average of all tokens def mean_pooling(model_output, attention_mask): token_embeddings = model_output.last_hidden_state input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) #Encode text def encode(texts): # Tokenize sentences encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input, return_dict=True) # Perform pooling embeddings = mean_pooling(model_output, encoded_input['attention_mask']) # Normalize embeddings embeddings = F.normalize(embeddings, p=2, dim=1) return embeddings # Sentences we want sentence embeddings for query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1") model = AutoModel.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1") #Encode query and docs query_emb = encode(query) doc_emb = encode(docs) #Compute dot score between query and all document embeddings scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
from transformers import AutoTokenizer, TFAutoModel import tensorflow as tf #Mean Pooling - Take attention mask into account for correct averaging def mean_pooling(model_output, attention_mask): token_embeddings = model_output.last_hidden_state input_mask_expanded = tf.cast(tf.tile(tf.expand_dims(attention_mask, -1), [1, 1, token_embeddings.shape[-1]]), tf.float32) return tf.math.reduce_sum(token_embeddings * input_mask_expanded, 1) / tf.math.maximum(tf.math.reduce_sum(input_mask_expanded, 1), 1e-9) #Encode text def encode(texts): # Tokenize sentences encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='tf') # Compute token embeddings model_output = model(**encoded_input, return_dict=True) # Perform pooling embeddings = mean_pooling(model_output, encoded_input['attention_mask']) # Normalize embeddings embeddings = tf.math.l2_normalize(embeddings, axis=1) return embeddings # Sentences we want sentence embeddings for query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1") model = TFAutoModel.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1") #Encode query and docs query_emb = encode(query) doc_emb = encode(docs) #Compute dot score between query and all document embeddings scores = (query_emb @ tf.transpose(doc_emb))[0].numpy().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
Setting | Value |
Dimensions | 384 |
Produces normalized embeddings | Yes |
Pooling-Method | Mean pooling |
Suitable score functions | dot-product ( util.dot_score ), cosine-similarity ( util.cos_sim ), or euclidean distance |
在由Hugging Face组织的 Community week using JAX/Flax for NLP & CV 中,我们开发了这个模型。我们作为项目的一部分开发了这个模型: Train the Best Sentence Embedding Model Ever with 1B Training Pairs 。我们从谷歌的Flax、JAX和云团队成员的高效深度学习框架干预中受益,并获得了运行项目的高效硬件基础设施7个TPU v3-8。
我们使用了预训练的 nreimers/MiniLM-L6-H384-uncased 模型。有关预训练过程的更多详细信息,请参阅模型卡。
该模型使用 MultipleNegativesRankingLoss 进行训练,使用Mean-pooling作为相似度函数,并具有20的缩放比例。
Dataset | Number of training tuples |
12312321 Duplicate question pairs from WikiAnswers | 77,427,422 |
12313321 Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia | 64,371,441 |
12314321 (Title, Body) pairs from all StackExchanges | 25,316,456 |
12314321 (Title, Answer) pairs from all StackExchanges | 21,396,559 |
12316321 Triplets (query, answer, hard_negative) for 500k queries from Bing search engine | 17,579,773 |
12317321 (query, answer) pairs for 3M Google queries and Google featured snippet | 3,012,496 |
12318321 (Question, Answer) pairs from Amazon product pages | 2,448,839 |
12319321 (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
12319321 (Question, Answer) pairs from Yahoo Answers | 681,164 |
12319321 (Title, Question) pairs from Yahoo Answers | 659,896 |
12322321 (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question | 582,261 |
12323321 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) | 325,475 |
12314321 Duplicate questions pairs (titles) | 304,525 |
12325321 (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset | 103,663 |
12326321 (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph | 100,231 |
12327321 (Question, Paragraph) pairs from SQuAD2.0 dataset | 87,599 |
12328321 (Question, Evidence) pairs | 73,346 |
Total | 214,988,242 |