ner-bert-base-portuguese-cased-lenerbr是一个在葡萄牙语法律领域进行微调的命名实体识别(标记分类)模型,该模型于2021年12月20日在Google Colab上使用NER目标从模型 pierreguillou/bert-base-cased-pt-lenerbr 和数据集 LeNER_br 进行微调。
由于BERTimbau base的规模较小,以及微调数据集的规模较小,模型在达到训练结束之前就已经出现了过拟合的情况。下面是验证数据集上的总体最终指标(注意:请参阅“按命名实体分类的验证指标”一段获得详细指标):
同时请查看f1为0.908的 large version of this model 。
注意:模型 pierreguillou/bert-base-cased-pt-lenerbr 是一个语言模型,是通过对模型 BERTimbau base 进行数据集 LeNER-Br language modeling 的MASK目标微调而创建的。在进行NER任务的微调之前,这个语言模型的首次特化稍微提高了模型的质量。为了证明这一点,这里是从一个非专门化语言模型 BERTimbau base 微调的NER模型的结果:
NLP | Modelos e Web App para Reconhecimento de Entidade Nomeada (NER) no domínio jurídico brasileiro (2021年12月29日)
您可以在此页面的小部件中测试此模型。
还可以使用 NER App 进行测试,该测试可以比较NER任务中使用法律LeNER-Br数据集拟合的2个BERT模型(base和large)。
# install pytorch: check https://pytorch.org/ # !pip install transformers from transformers import AutoModelForTokenClassification, AutoTokenizer import torch # parameters model_name = "pierreguillou/ner-bert-base-cased-pt-lenerbr" model = AutoModelForTokenClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) input_text = "Acrescento que não há de se falar em violação do artigo 114, § 3º, da Constituição Federal, posto que referido dispositivo revela-se impertinente, tratando da possibilidade de ajuizamento de dissídio coletivo pelo Ministério Público do Trabalho nos casos de greve em atividade essencial." # tokenization inputs = tokenizer(input_text, max_length=512, truncation=True, return_tensors="pt") tokens = inputs.tokens() # get predictions outputs = model(**inputs).logits predictions = torch.argmax(outputs, dim=2) # print predictions for token, prediction in zip(tokens, predictions[0].numpy()): print((token, model.config.id2label[prediction]))
您还可以使用pipeline。但是,它似乎存在关于输入序列的max_length的问题。
!pip install transformers import transformers from transformers import pipeline model_name = "pierreguillou/ner-bert-base-cased-pt-lenerbr" ner = pipeline( "ner", model=model_name ) ner(input_text)
微调的笔记本( HuggingFace_Notebook_token_classification_NER_LeNER_Br.ipynb )在github上。
Num examples = 7828 Num Epochs = 10 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 4 Gradient Accumulation steps = 2 Total optimization steps = 19570 Step Training Loss Validation Loss Precision Recall F1 Accuracy 300 0.127600 0.178613 0.722909 0.741720 0.732194 0.948802 600 0.088200 0.136965 0.733636 0.867742 0.795074 0.963079 900 0.078000 0.128858 0.791912 0.838065 0.814335 0.965243 1200 0.077800 0.126345 0.815400 0.865376 0.839645 0.967849 1500 0.074100 0.148207 0.779274 0.895914 0.833533 0.960184 1800 0.059500 0.116634 0.830829 0.868172 0.849090 0.969342 2100 0.044500 0.208459 0.887150 0.816559 0.850392 0.960535 2400 0.029400 0.136352 0.867821 0.851398 0.859531 0.970271 2700 0.025000 0.165837 0.814881 0.878495 0.845493 0.961235 3000 0.038400 0.120629 0.811719 0.893763 0.850768 0.971506 3300 0.026200 0.175094 0.823435 0.882581 0.851983 0.962957 3600 0.025600 0.178438 0.881095 0.886022 0.883551 0.963689 3900 0.041000 0.134648 0.789035 0.916129 0.847846 0.967681 4200 0.026700 0.130178 0.821275 0.903226 0.860303 0.972313 4500 0.018500 0.139294 0.844016 0.875054 0.859255 0.971140 4800 0.020800 0.197811 0.892504 0.873118 0.882705 0.965883 5100 0.019300 0.161239 0.848746 0.888172 0.868012 0.967849 5400 0.024000 0.139131 0.837507 0.913333 0.873778 0.970591 5700 0.018400 0.157223 0.899754 0.864731 0.881895 0.970210 6000 0.023500 0.137022 0.883018 0.873333 0.878149 0.973243 6300 0.009300 0.181448 0.840490 0.900860 0.869628 0.968290 6600 0.019200 0.173125 0.821316 0.896559 0.857290 0.966736 6900 0.016100 0.143160 0.789938 0.904946 0.843540 0.968245 7200 0.017000 0.145755 0.823274 0.897634 0.858848 0.969037 7500 0.012100 0.159342 0.825694 0.883226 0.853491 0.967468 7800 0.013800 0.194886 0.861237 0.859570 0.860403 0.964771 8100 0.008000 0.140271 0.829914 0.896129 0.861752 0.971567 8400 0.010300 0.143318 0.826844 0.908817 0.865895 0.973578 8700 0.015000 0.143392 0.847336 0.889247 0.867786 0.973365 9000 0.006000 0.143512 0.847795 0.905591 0.875741 0.972892 9300 0.011800 0.138747 0.827133 0.894194 0.859357 0.971673 9600 0.008500 0.159490 0.837030 0.909032 0.871546 0.970028 9900 0.010700 0.159249 0.846692 0.910968 0.877655 0.970546 10200 0.008100 0.170069 0.848288 0.900645 0.873683 0.969113 10500 0.004800 0.183795 0.860317 0.899355 0.879403 0.969570 10800 0.010700 0.157024 0.837838 0.906667 0.870894 0.971094 11100 0.003800 0.164286 0.845312 0.880215 0.862410 0.970744 11400 0.009700 0.204025 0.884294 0.887527 0.885907 0.968854 11700 0.008900 0.162819 0.829415 0.887742 0.857588 0.970530 12000 0.006400 0.164296 0.852666 0.901075 0.876202 0.971414 12300 0.007100 0.143367 0.852959 0.895699 0.873807 0.973669 12600 0.015800 0.153383 0.859224 0.900430 0.879345 0.972679 12900 0.006600 0.173447 0.869954 0.899140 0.884306 0.970927 13200 0.006800 0.163234 0.856849 0.897204 0.876563 0.971795 13500 0.003200 0.167164 0.850867 0.907957 0.878485 0.971231 13800 0.003600 0.148950 0.867801 0.910538 0.888656 0.976961 14100 0.003500 0.155691 0.847621 0.907957 0.876752 0.974127 14400 0.003300 0.157672 0.846553 0.911183 0.877680 0.974584 14700 0.002500 0.169965 0.847804 0.917634 0.881338 0.973045 15000 0.003400 0.177099 0.842199 0.912473 0.875929 0.971155 15300 0.006000 0.164151 0.848928 0.911183 0.878954 0.973258 15600 0.002400 0.174305 0.847437 0.906667 0.876052 0.971765 15900 0.004100 0.174561 0.852929 0.907957 0.879583 0.972907 16200 0.002600 0.172626 0.843263 0.907097 0.874016 0.972100 16500 0.002100 0.185302 0.841108 0.907312 0.872957 0.970485 16800 0.002900 0.175638 0.840557 0.909247 0.873554 0.971704 17100 0.001600 0.178750 0.857056 0.906452 0.881062 0.971765 17400 0.003900 0.188910 0.853619 0.907957 0.879950 0.970835 17700 0.002700 0.180822 0.864699 0.907097 0.885390 0.972283 18000 0.001300 0.179974 0.868150 0.906237 0.886785 0.973060 18300 0.000800 0.188032 0.881022 0.904516 0.892615 0.972572 18600 0.002700 0.183266 0.868601 0.901290 0.884644 0.972298 18900 0.001600 0.180301 0.862041 0.903011 0.882050 0.972344 19200 0.002300 0.183432 0.855370 0.904301 0.879155 0.971109 19500 0.001800 0.183381 0.854501 0.904301 0.878696 0.971186
Num examples = 1177 {'JURISPRUDENCIA': {'f1': 0.7016574585635359, 'number': 657, 'precision': 0.6422250316055625, 'recall': 0.7732115677321156}, 'LEGISLACAO': {'f1': 0.8839681133746677, 'number': 571, 'precision': 0.8942652329749103, 'recall': 0.8739054290718039}, 'LOCAL': {'f1': 0.8253968253968254, 'number': 194, 'precision': 0.7368421052631579, 'recall': 0.9381443298969072}, 'ORGANIZACAO': {'f1': 0.8934049079754601, 'number': 1340, 'precision': 0.918769716088328, 'recall': 0.8694029850746269}, 'PESSOA': {'f1': 0.982653539615565, 'number': 1072, 'precision': 0.9877474081055608, 'recall': 0.9776119402985075}, 'TEMPO': {'f1': 0.9657657657657657, 'number': 816, 'precision': 0.9469964664310954, 'recall': 0.9852941176470589}, 'overall_accuracy': 0.9725722644643211, 'overall_f1': 0.8926146010186757, 'overall_precision': 0.8810222036028488, 'overall_recall': 0.9045161290322581}