模型:
ai-forever/sbert_large_mt_nlu_ru
The model is described in this article Russian SuperGLUE metrics
For better quality, use mean token embeddings.
You can use the model directly from the model repository to compute sentence embeddings:
from transformers import AutoTokenizer, AutoModel import torch #Mean Pooling - Take attention mask into account for correct averaging def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] #First element of model_output contains all token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9) return sum_embeddings / sum_mask #Sentences we want sentence embeddings for sentences = ['Привет! Как твои дела?', 'А правда, что 42 твое любимое число?'] #Load AutoModel from huggingface model repository tokenizer = AutoTokenizer.from_pretrained("sberbank-ai/sbert_large_mt_nlu_ru") model = AutoModel.from_pretrained("sberbank-ai/sbert_large_mt_nlu_ru") #Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=24, return_tensors='pt') #Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) #Perform pooling. In this case, mean pooling sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])