模型:
sentence-transformers/msmarco-bert-co-condensor
这是将 Luyu/co-condenser-marco-retriever 模型转化为 sentence-transformers 模型的一个移植版本:它将句子和段落映射到一个768维的密集向量空间中,并且针对语义搜索任务进行了优化。
它基于论文: Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval
Model | MS MARCO Dev (MRR@10) | TREC DL 2019 | TREC DL 2020 | FiQA (NDCG@10) | TREC COVID (NDCG@10) | TREC News (NDCG@10) | TREC Robust04 (NDCG@10) |
---|---|---|---|---|---|---|---|
1237321 | 33.01 | 67.84 | 66.04 | 29.5 | 67.12 | 38.2 | 39.2 |
1238321 | 35.51 | 68.16 | 69.13 | 26.04 | 66.89 | 28.54 | 30.71 |
1239321 | 34.43 | 71.04 | 69.78 | 30.02 | 65.39 | 37.70 | 42.70 |
12310321 | 37.25 | 70.14 | 71.08 | 28.61 | 71.96 | 37.88 | 38.29 |
12311321 | 38.08 | 70.51 | 73.45 | 32.29 | 74.81 | 38.81 | 42.67 |
有关比较的更多细节,请参见: SBERT.net - MSMARCO Models
在论文中,高和Callan声称MS MARCO-Dev得分为38.2(MRR@10)。这是通过更改基准实现的:原始的MS MARCO数据集仅提供查询和文本段落,您必须从中检索给定查询的相关段落。
在他们的 code 中,他们将段落与MS MARCO文档任务的文档标题相结合,即他们使用来自不同基准的其他信息来训练和评估模型。在上面的表格中,35.41(MRR@10)的得分是在MS MARCO Passages基准上提出的,没有文档标题。
他们还使用文档标题训练了他们的模型,这会导致信息泄漏:文档标题是在MS MARCO组织者在MS MARCO文档基准的后期重新构造的。对于所有段落,不可能重建所有文档标题。然而,具有标题的相关段落和无关段落的分布并不相等:71.9%的相关段落具有文档标题,而只有64.4%的无关段落具有标题。因此,模型可以学会,只要存在文档标题,将该段落标注为相关的概率就会更高。它不会根据段落内容做出决策,而是根据是否有标题来判断。
信息泄漏和基准的更改可能导致论文中报告的得分夸大。
当您安装了 sentence-transformers 后,使用此模型变得很容易:
pip install -U sentence-transformers
然后,您可以像这样使用模型:
from sentence_transformers import SentenceTransformer, util query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] #Load the model model = SentenceTransformer('sentence-transformers/msmarco-bert-co-condensor') #Encode query and documents query_emb = model.encode(query) doc_emb = model.encode(docs) #Compute dot score between query and all document embeddings scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
如果没有 sentence-transformers ,您可以像这样使用模型:首先,将输入通过transformer模型,然后必须在上下文化的单词嵌入之上应用正确的汇聚操作。
from transformers import AutoTokenizer, AutoModel import torch #CLS Pooling - Take output from first token def cls_pooling(model_output): return model_output.last_hidden_state[:,0] #Encode text def encode(texts): # Tokenize sentences encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input, return_dict=True) # Perform pooling embeddings = cls_pooling(model_output) return embeddings # Sentences we want sentence embeddings for query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-bert-co-condensor") model = AutoModel.from_pretrained("sentence-transformers/msmarco-bert-co-condensor") #Encode query and docs query_emb = encode(query) doc_emb = encode(docs) #Compute dot score between query and all document embeddings scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist() #Combine docs & scores doc_score_pairs = list(zip(docs, scores)) #Sort by decreasing score doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) #Output passages & scores for doc, score in doc_score_pairs: print(score, doc)
关于此模型的自动化评估,请参阅句子嵌入基准: https://seb.sbert.net
SentenceTransformer( (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: DistilBertModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}) )
请参阅: Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval