模型:
cross-encoder/msmarco-MiniLM-L6-en-de-v1
这是一个用于英德跨语言分段重排序的Cross-Encoder模型。它是在 MS Marco Passage Ranking 任务上训练的。
该模型可用于信息检索:请参阅 SBERT.net Retrieve & Re-rank 。
训练代码在此存储库中可用,请参阅train_script.py。
当您安装了 SentenceTransformers 后,可以像这样使用该模型:
from sentence_transformers import CrossEncoder model = CrossEncoder('model_name', max_length=512) query = 'How many people live in Berlin?' docs = ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'] pairs = [(query, doc) for doc in docs] scores = model.predict(pairs)
使用transformers库,您可以像这样使用模型:
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model = AutoModelForSequenceClassification.from_pretrained('model_name') tokenizer = AutoTokenizer.from_pretrained('model_name') features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt") model.eval() with torch.no_grad(): scores = model(**features).logits print(scores)
对该模型在三个数据集上进行了评估:
我们还检查了使用相同评估方法的双编码器的性能:使用余弦相似度使用查询和段落嵌入对BM25词汇搜索检索的文档进行重新排序。双编码器还可以用于端到端语义搜索。
Model-Name | TREC-DL19 EN-EN | TREC-DL19 DE-EN | GermanDPR DE-DE | Docs / Sec |
---|---|---|---|---|
BM25 | 45.46 | - | 35.85 | - |
Cross-Encoder Re-Rankers | ||||
1237321 | 72.43 | 65.53 | 46.77 | 1600 |
1238321 | 72.94 | 66.07 | 49.91 | 900 |
1239321 (DE only) | - | - | 53.67 | 260 |
12310321 (DE only) | - | - | 53.59 | 260 |
Bi-Encoders (re-ranking) | ||||
12311321 | 63.38 | 58.28 | 37.88 | 940 |
12312321 | 65.51 | 58.69 | 38.32 | 940 |
12313321 (DE only) | - | - | 34.31 | 450 |
12314321 (DE only) | - | - | 42.55 | 450 |
注意:Docs / Sec表示我们可以在V100 GPU上在一秒内重排序的(查询,文档)对数目。