cointegrated/rubert-tiny2 | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

模型:

cointegrated/rubert-tiny2

任务:

特征提取

类库:

PyTorch Safetensors Transformers

语言:

其他:

bert pretraining russian 填充掩码 embeddings masked-lm tiny 句子相似度

许可:

mit

模型介绍文件清单

中文

This is an updated version of cointegrated/rubert-tiny : a small Russian BERT-based encoder with high-quality sentence embeddings. This post in Russian gives more details.

The differences from the previous version include:

a larger vocabulary: 83828 tokens instead of 29564;
larger supported sequences: 2048 instead of 512;
sentence embeddings approximate LaBSE closer than before;
meaningful segment embeddings (tuned on the NLI task)
the model is focused only on Russian.

The model should be used as is to produce sentence embeddings (e.g. for KNN classification of short texts) or fine-tuned for a downstream task.

Sentence embeddings can be produced as follows:

# pip install transformers sentencepiece
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2")
model = AutoModel.from_pretrained("cointegrated/rubert-tiny2")
# model.cuda()  # uncomment it if you have a GPU

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (312,)

作者:

David Dale

数据集大小:

227.8 MB