英文

基于SlovakBERT的命名实体识别

该模型是在Slovak wikiann数据集上对 gerulata/slovakbert 进行微调的版本。在评估集上取得以下结果:

  • 损失:0.1600
  • 精确度:0.9327
  • 召回率:0.9470
  • F1值:0.9398
  • 准确度:0.9785

用途和限制

支持的类别:LOCATION、PERSON、ORGANIZATION

from transformers import pipeline


ner_pipeline = pipeline(task='ner', model='crabz/slovakbert-ner')
input_sentence = "Minister financií a líder mandátovo najsilnejšieho hnutia OĽaNO Igor Matovič upozorňuje, že následky tretej vlny budú na Slovensku veľmi veľké."
classifications = ner_pipeline(input_sentence)

使用displaCy的

import spacy
from spacy import displacy


ner_map = {0: '0', 1: 'B-OSOBA', 2: 'I-OSOBA', 3: 'B-ORGANIZÁCIA', 4: 'I-ORGANIZÁCIA', 5: 'B-LOKALITA', 6: 'I-LOKALITA'}

entities = []
for i in range(len(classifications)):
    if classifications[i]['entity'] != 0:
        if ner_map[classifications[i]['entity']][0] == 'B':
            j = i + 1
            while j < len(classifications) and ner_map[classifications[j]['entity']][0] == 'I':
                j += 1
            entities.append((ner_map[classifications[i]['entity']].split('-')[1], classifications[i]['start'],
                             classifications[j - 1]['end']))

nlp = spacy.blank("en")  # it should work with any language

doc = nlp(input_sentence)

ents = []
for ee in entities:
    ents.append(doc.char_span(ee[1], ee[2], ee[0]))

doc.ents = ents

options = {"ents": ["OSOBA", "ORGANIZÁCIA", "LOKALITA"],
           "colors": {"OSOBA": "lightblue", "ORGANIZÁCIA": "lightcoral", "LOKALITA": "lightgreen"}}
displacy_html = displacy.render(doc, style="ent", options=options)
如下:

import spacy
from spacy import displacy


ner_map = {0: '0', 1: 'B-OSOBA', 2: 'I-OSOBA', 3: 'B-ORGANIZÁCIA', 4: 'I-ORGANIZÁCIA', 5: 'B-LOKALITA', 6: 'I-LOKALITA'}

entities = []
for i in range(len(classifications)):
    if classifications[i]['entity'] != 0:
        if ner_map[classifications[i]['entity']][0] == 'B':
            j = i + 1
            while j < len(classifications) and ner_map[classifications[j]['entity']][0] == 'I':
                j += 1
            entities.append((ner_map[classifications[i]['entity']].split('-')[1], classifications[i]['start'],
                             classifications[j - 1]['end']))

nlp = spacy.blank("en")  # it should work with any language

doc = nlp(input_sentence)

ents = []
for ee in entities:
    ents.append(doc.char_span(ee[1], ee[2], ee[0]))

doc.ents = ents

options = {"ents": ["OSOBA", "ORGANIZÁCIA", "LOKALITA"],
           "colors": {"OSOBA": "lightblue", "ORGANIZÁCIA": "lightcoral", "LOKALITA": "lightgreen"}}
displacy_html = displacy.render(doc, style="ent", options=options)
Minister financií a líder mandátovo najsilnejšieho hnutia OĽaNO ORGANIZÁCIA Igor Matovič OSOBA upozorňuje, že následky tretej vlny budú na Slovensku LOKALITA veľmi veľké.

训练过程

训练超参数

训练过程中使用了以下超参数:

  • 学习率:5e-05
  • 训练批次大小:32
  • 评估批次大小:8
  • 种子:42
  • 优化器:带有betas=(0.9,0.999)和epsilon=1e-08的Adam
  • 学习率调度器类型:linear
  • 训练轮数:15.0

训练结果

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
0.2342 1.0 625 0.1233 0.8891 0.9076 0.8982 0.9667
0.1114 2.0 1250 0.1079 0.9118 0.9269 0.9193 0.9725
0.0817 3.0 1875 0.1093 0.9173 0.9315 0.9243 0.9747
0.0438 4.0 2500 0.1076 0.9188 0.9353 0.9270 0.9743
0.028 5.0 3125 0.1230 0.9143 0.9387 0.9264 0.9744
0.0256 6.0 3750 0.1204 0.9246 0.9423 0.9334 0.9765
0.018 7.0 4375 0.1332 0.9292 0.9416 0.9353 0.9770
0.0107 8.0 5000 0.1339 0.9280 0.9427 0.9353 0.9769
0.0079 9.0 5625 0.1368 0.9326 0.9442 0.9383 0.9785
0.0065 10.0 6250 0.1490 0.9284 0.9445 0.9364 0.9772
0.0061 11.0 6875 0.1566 0.9328 0.9433 0.9380 0.9778
0.0031 12.0 7500 0.1555 0.9339 0.9473 0.9406 0.9787
0.0024 13.0 8125 0.1548 0.9349 0.9462 0.9405 0.9787
0.0015 14.0 8750 0.1562 0.9330 0.9469 0.9399 0.9788
0.0013 15.0 9375 0.1600 0.9327 0.9470 0.9398 0.9785

框架版本

  • Transformers 4.13.0.dev0
  • Pytorch 1.10.0+cu113
  • Datasets 1.15.1
  • Tokenizers 0.10.3