ner-bert-large-portuguese-cased-lenerbr是一个在法律领域中使用葡萄牙语进行NER(标记分类)的模型,它是在2021年12月20日在Google Colab上使用训练集 LeNER_br 对模型 pierreguillou/bert-large-cased-pt-lenerbr 进行微调的结果。训练时由于微调数据集的规模较小,模型在训练结束前就产生了过拟合。以下是验证数据集的整体最终指标(请参阅"按命名实体分类的验证指标"段落以获取详细指标):
同时也请查看f1为0.893的模型 base version of this model 。
注意:模型 pierreguillou/bert-large-cased-pt-lenerbr 是通过在数据集 LeNER-Br language modeling 上使用MASK目标对模型 BERTimbau large 进行微调而生成的语言模型,此语言模型在进行NER任务微调之前进行首次特化,以获得更好的NER模型效果。
NLP | Modelos e Web App para Reconhecimento de Entidade Nomeada (NER) no domínio jurídico brasileiro (2021年12月29日)
您可以在此页面的小工具中测试此模型。
还可以使用 NER App ,它可以比较使用葡萄牙语LeNER-Br数据集在NER任务中适应的两个BERT模型(base和large)。
# install pytorch: check https://pytorch.org/ # !pip install transformers from transformers import AutoModelForTokenClassification, AutoTokenizer import torch # parameters model_name = "pierreguillou/ner-bert-large-cased-pt-lenerbr" model = AutoModelForTokenClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) input_text = "Acrescento que não há de se falar em violação do artigo 114, § 3º, da Constituição Federal, posto que referido dispositivo revela-se impertinente, tratando da possibilidade de ajuizamento de dissídio coletivo pelo Ministério Público do Trabalho nos casos de greve em atividade essencial." # tokenization inputs = tokenizer(input_text, max_length=512, truncation=True, return_tensors="pt") tokens = inputs.tokens() # get predictions outputs = model(**inputs).logits predictions = torch.argmax(outputs, dim=2) # print predictions for token, prediction in zip(tokens, predictions[0].numpy()): print((token, model.config.id2label[prediction]))
也可以使用pipeline。但是,由于输入序列的max_length存在问题。
!pip install transformers import transformers from transformers import pipeline model_name = "pierreguillou/ner-bert-large-cased-pt-lenerbr" ner = pipeline( "ner", model=model_name ) ner(input_text)
微调的笔记本( HuggingFace_Notebook_token_classification_NER_LeNER_Br.ipynb )在github上。
Num examples = 7828 Num Epochs = 20 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 4 Gradient Accumulation steps = 2 Total optimization steps = 39140 Step Training Loss Validation Loss Precision Recall F1 Accuracy 500 0.250000 0.140582 0.760833 0.770323 0.765548 0.963125 1000 0.076200 0.117882 0.829082 0.817849 0.823428 0.966569 1500 0.082400 0.150047 0.679610 0.914624 0.779795 0.957213 2000 0.047500 0.133443 0.817678 0.857419 0.837077 0.969190 2500 0.034200 0.230139 0.895672 0.845591 0.869912 0.964070 3000 0.033800 0.108022 0.859225 0.887312 0.873043 0.973700 3500 0.030100 0.113467 0.855747 0.885376 0.870310 0.975879 4000 0.029900 0.118619 0.850207 0.884946 0.867229 0.974477 4500 0.022500 0.124327 0.841048 0.890968 0.865288 0.975041 5000 0.020200 0.129294 0.801538 0.918925 0.856227 0.968077 5500 0.019700 0.128344 0.814222 0.908602 0.858827 0.969250 6000 0.024600 0.182563 0.908087 0.866882 0.887006 0.968565 6500 0.012600 0.159217 0.829883 0.913763 0.869806 0.969357 7000 0.020600 0.183726 0.854557 0.893333 0.873515 0.966447 7500 0.014400 0.141395 0.777716 0.905161 0.836613 0.966828 8000 0.013400 0.139378 0.873042 0.899140 0.885899 0.975772 8500 0.014700 0.142521 0.864152 0.901505 0.882433 0.976366 9000 0.010900 0.122889 0.897522 0.919140 0.908202 0.980831 9500 0.013500 0.143407 0.816580 0.906667 0.859268 0.973395 10000 0.010400 0.144946 0.835608 0.908387 0.870479 0.974629 10500 0.007800 0.143086 0.847587 0.910108 0.877735 0.975985 11000 0.008200 0.156379 0.873778 0.884301 0.879008 0.976321 11500 0.008200 0.133356 0.901193 0.910108 0.905628 0.980328 12000 0.006900 0.133476 0.892202 0.920215 0.905992 0.980572 12500 0.006900 0.129991 0.890159 0.904516 0.897280 0.978683
{'JURISPRUDENCIA': {'f1': 0.8135593220338984, 'number': 657, 'precision': 0.865979381443299, 'recall': 0.7671232876712328}, 'LEGISLACAO': {'f1': 0.8888888888888888, 'number': 571, 'precision': 0.8952042628774423, 'recall': 0.882661996497373}, 'LOCAL': {'f1': 0.850467289719626, 'number': 194, 'precision': 0.7777777777777778, 'recall': 0.9381443298969072}, 'ORGANIZACAO': {'f1': 0.8740635033892258, 'number': 1340, 'precision': 0.8373205741626795, 'recall': 0.914179104477612}, 'PESSOA': {'f1': 0.9836677554829678, 'number': 1072, 'precision': 0.9841269841269841, 'recall': 0.9832089552238806}, 'TEMPO': {'f1': 0.9669669669669669, 'number': 816, 'precision': 0.9481743227326266, 'recall': 0.9865196078431373}, 'overall_accuracy': 0.9808310603867311, 'overall_f1': 0.9082022949426265, 'overall_precision': 0.8975220495590088, 'overall_recall': 0.9191397849462366}