英文

(BERT base) 法律领域葡萄牙语NER模型

正在建设中的自述文件

ner-legal-bert-base-cased-ptbr是一个用于葡萄牙语法律领域的NER模型(令牌分类),该模型通过使用NER目标从模型 dominguesm/legal-bert-base-cased-ptbr 进行了微调。

该模型旨在辅助法律领域、计算法学和法律技术应用的自然语言处理研究。使用了葡萄牙语的多个法律文本(详细信息如下),并使用了以下标签:

  • PESSOA
  • ORGANIZACAO
  • LOCAL
  • TEMPO
  • LEGISLACAO
  • JURISPRUDENCIA

这些标签受到了数据集 LeNER_br 的启发。

训练数据集

ner-legal-bert-base-cased-ptbr的数据集包括:

  • 971932个杂项法律文件示例(训练集拆分)
  • 53996个杂项法律文件示例(验证集拆分)
  • 53997个杂项法律文件示例(测试集拆分)

使用的数据由巴西联邦最高法院提供,遵循以下使用条款: LREC 2020

本项目的结果不以任何方式暗示巴西联邦最高法院的立场,所有责任均由该模型的作者承担。

使用模型进行生产推理

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# parameters
model_name = "dominguesm/ner-legal-bert-base-cased-ptbr"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_text = "Acrescento que não há de se falar em violação do artigo 114, § 3º, da Constituição Federal, posto que referido dispositivo revela-se impertinente, tratando da possibilidade de ajuizamento de dissídio coletivo pelo Ministério Público do Trabalho nos casos de greve em atividade essencial."

# tokenization
inputs = tokenizer(input_text, max_length=512, truncation=True, return_tensors="pt")
tokens = inputs.tokens()

# get predictions
outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)

# print predictions
for token, prediction in zip(tokens, predictions[0].numpy()):
    print((token, model.config.id2label[prediction]))

您也可以使用pipeline。但是,输入序列的最大长度似乎存在问题。

from transformers import pipeline

model_name = "dominguesm/ner-legal-bert-base-cased-ptbr"

ner = pipeline(
    "ner",
    model=model_name
) 

ner(input_text, aggregation_strategy="average")

训练过程

超参数

batch,学习率...
  • per_device_batch_size = 64
  • gradient_accumulation_steps = 2
  • learning_rate = 2e-5
  • num_train_epochs = 3
  • weight_decay = 0.01
  • optimizer = torch.optim.AdamW
  • epsilon = 1e-08
  • lr_scheduler_type = linear
保存模型并加载最佳模型
  • save_total_limit = 3
  • logging_steps = 1000
  • eval_steps = logging_steps
  • evaluation_strategy = 'steps'
  • logging_strategy = 'steps'
  • save_strategy = 'steps'
  • save_steps = logging_steps
  • load_best_model_at_end = True
  • fp16 = True

训练结果

Num examples = 971932
Num Epochs = 3
Instantaneous batch size per device = 64
Total train batch size (w. parallel, distributed & accumulation) = 128
Gradient Accumulation steps = 2
Total optimization steps = 22779
Evaluation Infos:
  Num examples = 53996
  Batch size = 128
Step Training Loss Validation Loss Precision Recall F1 Accuracy
1000 0.113900 0.057008 0.898600 0.938444 0.918090
2000 0.052800 0.048254 0.917243 0.941188 0.929062
3000 0.046200 0.043833 0.919706 0.948411 0.933838
4000 0.043500 0.039796 0.928439 0.947058 0.937656
5000 0.041400 0.039421 0.926103 0.952857 0.939290
6000 0.039700 0.038599 0.922376 0.956257 0.939011
7000 0.037800 0.036463 0.935125 0.950937 0.942964
8000 0.035900 0.035706 0.934638 0.954147 0.944292
9000 0.033800 0.034518 0.940354 0.951991 0.946136
10000 0.033600 0.033454 0.938170 0.956097 0.947049
11000 0.032700 0.032899 0.934130 0.959491 0.946641
12000 0.032200 0.032477 0.937400 0.959150 0.948151
13000 0.031200 0.033207 0.937058 0.960506 0.948637
14000 0.031400 0.031711 0.938765 0.959711 0.949123
15000 0.030600 0.031519 0.940488 0.959413 0.949856
16000 0.028500 0.031618 0.943643 0.957693 0.950616
17000 0.028000 0.031106 0.941109 0.960687 0.950797
18000 0.027800 0.030712 0.942821 0.960528 0.951592
19000 0.027500 0.030523 0.942950 0.960947 0.951864
20000 0.027400 0.030577 0.942462 0.961754 0.952010
21000 0.027000 0.030025 0.944483 0.960497 0.952422
22000 0.026800 0.030162 0.943868 0.961418 0.952562

命名实体(测试数据集)的验证指标

  • Num examples = 53997
  • overall_precision:0.9432396865925381
  • overall_recall:0.9614334116769161
  • overall_f1:0.9522496545298874
  • overall_accuracy:0.9894741602608071
Label Precision Recall F1 Accuracy Entity Examples
JURISPRUDENCIA 0.8795197115548148 0.9037275221501844 0.8914593047810311 57223
LEGISLACAO 0.9405395935529082 0.9514071028567378 0.9459421362370934 84642
LOCAL 0.9011495452253004 0.9132358124779697 0.9071524233856495 56740
ORGANIZACAO 0.9239028155165304 0.954964947845235 0.9391771163875446 183013
PESSOA 0.9651685220572037 0.9738545198908279 0.9694920661875761 193456
TEMPO 0.973704616066295 0.9918808401799004 0.9827086882453152 186103

  • 在制作此readme时,我以Pierre Guillou的readme( here 可用)为基础,完全复制了部分内容。