模型:
mrm8488/TinyBERT-spanish-uncased-finetuned-ner
这个模型是在我创建的 Spanish Tiny Bert 模型上使用蒸馏技术进行 NER(命名实体识别)下游任务微调得到的。模型的大小为 55MB
我对数据集进行了预处理,并将其分为训练集和验证集(80/20)
Dataset | # Examples |
---|---|
Train | 8.7 K |
Dev | 2.2 K |
B-LOC B-MISC B-ORG B-PER I-LOC I-MISC I-ORG I-PER O
Metric | # score |
---|---|
F1 | 70.00 |
Precision | 67.83 |
Recall | 71.46 |
Model | # F1 score | Size(MB) |
---|---|---|
bert-base-spanish-wwm-cased (BETO) | 88.43 | 421 |
1236321 | 90.17 | 420 |
Best Multilingual BERT | 87.38 | 681 |
TinyBERT-spanish-uncased-finetuned-ner (this one) | 70.00 | 55 |
使用示例:
import torch from transformers import AutoModelForTokenClassification, AutoTokenizer id2label = { "0": "B-LOC", "1": "B-MISC", "2": "B-ORG", "3": "B-PER", "4": "I-LOC", "5": "I-MISC", "6": "I-ORG", "7": "I-PER", "8": "O" } tokenizer = AutoTokenizer.from_pretrained('mrm8488/TinyBERT-spanish-uncased-finetuned-ner') model = AutoModelForTokenClassification.from_pretrained('mrm8488/TinyBERT-spanish-uncased-finetuned-ner') text ="Mis amigos están pensando viajar a Londres este verano." input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0) outputs = model(input_ids) last_hidden_states = outputs[0] for m in last_hidden_states: for index, n in enumerate(m): if(index > 0 and index <= len(text.split(" "))): print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())]) ''' Output: -------- Mis: O amigos: O están: O pensando: O viajar: O a: O Londres: B-LOC este: O verano.: O '''
在西班牙用 ♥ 制作