模型:

sentence-transformers/msmarco-bert-co-condensor

英文

sentence-transformers/msmarco-bert-co-condensor

这是将 Luyu/co-condenser-marco-retriever 模型转化为 sentence-transformers 模型的一个移植版本:它将句子和段落映射到一个768维的密集向量空间中,并且针对语义搜索任务进行了优化。

它基于论文: Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

评估

Model MS MARCO Dev (MRR@10) TREC DL 2019 TREC DL 2020 FiQA (NDCG@10) TREC COVID (NDCG@10) TREC News (NDCG@10) TREC Robust04 (NDCG@10)
1237321 33.01 67.84 66.04 29.5 67.12 38.2 39.2
1238321 35.51 68.16 69.13 26.04 66.89 28.54 30.71
1239321 34.43 71.04 69.78 30.02 65.39 37.70 42.70
12310321 37.25 70.14 71.08 28.61 71.96 37.88 38.29
12311321 38.08 70.51 73.45 32.29 74.81 38.81 42.67

有关比较的更多细节,请参见: SBERT.net - MSMARCO Models

在论文中,高和Callan声称MS MARCO-Dev得分为38.2(MRR@10)。这是通过更改基准实现的:原始的MS MARCO数据集仅提供查询和文本段落,您必须从中检索给定查询的相关段落。

在他们的 code 中,他们将段落与MS MARCO文档任务的文档标题相结合,即他们使用来自不同基准的其他信息来训练和评估模型。在上面的表格中,35.41(MRR@10)的得分是在MS MARCO Passages基准上提出的,没有文档标题。

他们还使用文档标题训练了他们的模型,这会导致信息泄漏:文档标题是在MS MARCO组织者在MS MARCO文档基准的后期重新构造的。对于所有段落,不可能重建所有文档标题。然而,具有标题的相关段落和无关段落的分布并不相等:71.9%的相关段落具有文档标题,而只有64.4%的无关段落具有标题。因此,模型可以学会,只要存在文档标题,将该段落标注为相关的概率就会更高。它不会根据段落内容做出决策,而是根据是否有标题来判断。

信息泄漏和基准的更改可能导致论文中报告的得分夸大。

用法(Sentence-Transformers)

当您安装了 sentence-transformers 后,使用此模型变得很容易:

pip install -U sentence-transformers

然后,您可以像这样使用模型:

from sentence_transformers import SentenceTransformer, util

query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

#Load the model
model = SentenceTransformer('sentence-transformers/msmarco-bert-co-condensor')

#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

#Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

用法(HuggingFace Transformers)

如果没有 sentence-transformers ,您可以像这样使用模型:首先,将输入通过transformer模型,然后必须在上下文化的单词嵌入之上应用正确的汇聚操作。

from transformers import AutoTokenizer, AutoModel
import torch

#CLS Pooling - Take output from first token
def cls_pooling(model_output):
    return model_output.last_hidden_state[:,0]

#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = cls_pooling(model_output)

    return embeddings


# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-bert-co-condensor")
model = AutoModel.from_pretrained("sentence-transformers/msmarco-bert-co-condensor")

#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

#Compute dot score between query and all document embeddings
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

评估结果

关于此模型的自动化评估,请参阅句子嵌入基准: https://seb.sbert.net

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

引用和作者

请参阅: Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval