Named Entity Recognition based on SlovakBERT

This model is a fine-tuned version of gerulata/slovakbert on the Slovak wikiann dataset. It achieves the following results on the evaluation set:

Loss: 0.1600
Precision: 0.9327
Recall: 0.9470
F1: 0.9398
Accuracy: 0.9785

Intended uses & limitations

Supported classes: LOCATION, PERSON, ORGANIZATION

from transformers import pipeline


ner_pipeline = pipeline(task='ner', model='crabz/slovakbert-ner')
input_sentence = "Minister financií a líder mandátovo najsilnejšieho hnutia OĽaNO Igor Matovič upozorňuje, že následky tretej vlny budú na Slovensku veľmi veľké."
classifications = ner_pipeline(input_sentence)

with displaCy :

import spacy
from spacy import displacy


ner_map = {0: '0', 1: 'B-OSOBA', 2: 'I-OSOBA', 3: 'B-ORGANIZÁCIA', 4: 'I-ORGANIZÁCIA', 5: 'B-LOKALITA', 6: 'I-LOKALITA'}

entities = []
for i in range(len(classifications)):
    if classifications[i]['entity'] != 0:
        if ner_map[classifications[i]['entity']][0] == 'B':
            j = i + 1
            while j < len(classifications) and ner_map[classifications[j]['entity']][0] == 'I':
                j += 1
            entities.append((ner_map[classifications[i]['entity']].split('-')[1], classifications[i]['start'],
                             classifications[j - 1]['end']))

nlp = spacy.blank("en")  # it should work with any language

doc = nlp(input_sentence)

ents = []
for ee in entities:
    ents.append(doc.char_span(ee[1], ee[2], ee[0]))

doc.ents = ents

options = {"ents": ["OSOBA", "ORGANIZÁCIA", "LOKALITA"],
           "colors": {"OSOBA": "lightblue", "ORGANIZÁCIA": "lightcoral", "LOKALITA": "lightgreen"}}
displacy_html = displacy.render(doc, style="ent", options=options)

Minister financií a líder mandátovo najsilnejšieho hnutia OĽaNO ORGANIZÁCIA Igor Matovič OSOBA upozorňuje, že následky tretej vlny budú na Slovensku LOKALITA veľmi veľké.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 32
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 15.0

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.2342	1.0	625	0.1233	0.8891	0.9076	0.8982	0.9667
0.1114	2.0	1250	0.1079	0.9118	0.9269	0.9193	0.9725
0.0817	3.0	1875	0.1093	0.9173	0.9315	0.9243	0.9747
0.0438	4.0	2500	0.1076	0.9188	0.9353	0.9270	0.9743
0.028	5.0	3125	0.1230	0.9143	0.9387	0.9264	0.9744
0.0256	6.0	3750	0.1204	0.9246	0.9423	0.9334	0.9765
0.018	7.0	4375	0.1332	0.9292	0.9416	0.9353	0.9770
0.0107	8.0	5000	0.1339	0.9280	0.9427	0.9353	0.9769
0.0079	9.0	5625	0.1368	0.9326	0.9442	0.9383	0.9785
0.0065	10.0	6250	0.1490	0.9284	0.9445	0.9364	0.9772
0.0061	11.0	6875	0.1566	0.9328	0.9433	0.9380	0.9778
0.0031	12.0	7500	0.1555	0.9339	0.9473	0.9406	0.9787
0.0024	13.0	8125	0.1548	0.9349	0.9462	0.9405	0.9787
0.0015	14.0	8750	0.1562	0.9330	0.9469	0.9399	0.9788
0.0013	15.0	9375	0.1600	0.9327	0.9470	0.9398	0.9785

Framework versions

Transformers 4.13.0.dev0
Pytorch 1.10.0+cu113
Datasets 1.15.1
Tokenizers 0.10.3

作者:

Ivan Agarsky

数据集大小:

476.12 MB