模型:
cross-encoder/msmarco-MiniLM-L12-en-de-v1
这是一个跨语言的跨编码器模型,用于英德文的段落重新排序。它是在 MS Marco Passage Ranking 任务上进行训练的。
该模型可用于信息检索:请参见 SBERT.net Retrieve & Re-rank 。
训练代码可在此存储库中找到,查看train_script.py。
当您安装了 SentenceTransformers 后,可以像这样使用该模型:
from sentence_transformers import CrossEncoder model = CrossEncoder('model_name', max_length=512) query = 'How many people live in Berlin?' docs = ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'] pairs = [(query, doc) for doc in docs] scores = model.predict(pairs)
使用transformers库,可以像这样使用该模型:
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model = AutoModelForSequenceClassification.from_pretrained('model_name') tokenizer = AutoTokenizer.from_pretrained('model_name') features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt") model.eval() with torch.no_grad(): scores = model(**features).logits print(scores)
性能在三个数据集上进行评估:
我们还检查了使用相同评估标准的双编码器的性能:使用查询和段落嵌入进行余弦相似度的重新排名从BM25词法搜索中检索到的文档。双编码器也可以用于端到端的语义搜索。
Model-Name | TREC-DL19 EN-EN | TREC-DL19 DE-EN | GermanDPR DE-DE | Docs / Sec |
---|---|---|---|---|
BM25 | 45.46 | - | 35.85 | - |
Cross-Encoder Re-Rankers | ||||
1237321 | 72.43 | 65.53 | 46.77 | 1600 |
1238321 | 72.94 | 66.07 | 49.91 | 900 |
1239321 (DE only) | - | - | 53.67 | 260 |
12310321 (DE only) | - | - | 53.59 | 260 |
Bi-Encoders (re-ranking) | ||||
12311321 | 63.38 | 58.28 | 37.88 | 940 |
12312321 | 65.51 | 58.69 | 38.32 | 940 |
12313321 (DE only) | - | - | 34.31 | 450 |
12314321 (DE only) | - | - | 42.55 | 450 |
注意:每秒文档数给出了我们在V100 GPU上可以重新排序的(查询,文档)对数。