模型:
elenanereiss/bert-german-ler
这个模型是在 German LER Dataset 数据集上,基于 bert-base-german-cased 进行微调得到的。
数据集中各类的分布情况:
Fine-grained classes | # | % | ||
---|---|---|---|---|
1 | PER | Person | 1,747 | 3.26 |
2 | RR | Judge | 1,519 | 2.83 |
3 | AN | Lawyer | 111 | 0.21 |
4 | LD | Country | 1,429 | 2.66 |
5 | ST | City | 705 | 1.31 |
6 | STR | Street | 136 | 0.25 |
7 | LDS | Landscape | 198 | 0.37 |
8 | ORG | Organization | 1,166 | 2.17 |
9 | UN | Company | 1,058 | 1.97 |
10 | INN | Institution | 2,196 | 4.09 |
11 | GRT | Court | 3,212 | 5.99 |
12 | MRK | Brand | 283 | 0.53 |
13 | GS | Law | 18,52 | 34.53 |
14 | VO | Ordinance | 797 | 1.49 |
15 | EUN | European legal norm | 1,499 | 2.79 |
16 | VS | Regulation | 607 | 1.13 |
17 | VT | Contract | 2,863 | 5.34 |
18 | RS | Court decision | 12,58 | 23.46 |
19 | LIT | Legal literature | 3,006 | 5.60 |
Total | 53,632 | 100 |
如何在德语LER数据集上进行另一个模型的微调,请参考 GitHub 。
训练过程中使用了以下超参数:
precision recall f1-score support AN 0.75 0.50 0.60 12 EUN 0.92 0.93 0.92 116 GRT 0.95 0.99 0.97 331 GS 0.98 0.98 0.98 1720 INN 0.84 0.91 0.88 199 LD 0.95 0.95 0.95 109 LDS 0.82 0.43 0.56 21 LIT 0.88 0.92 0.90 231 MRK 0.50 0.70 0.58 23 ORG 0.64 0.71 0.67 103 PER 0.86 0.93 0.90 186 RR 0.97 0.98 0.97 144 RS 0.94 0.95 0.94 1126 ST 0.91 0.88 0.89 58 STR 0.29 0.29 0.29 7 UN 0.81 0.85 0.83 143 VO 0.76 0.95 0.84 37 VS 0.62 0.80 0.70 56 VT 0.87 0.92 0.90 275 micro avg 0.92 0.94 0.93 4897 macro avg 0.80 0.82 0.80 4897 weighted avg 0.92 0.94 0.93 4897
precision recall f1-score support AN 1.00 0.89 0.94 9 EUN 0.90 0.97 0.93 150 GRT 0.98 0.98 0.98 321 GS 0.98 0.99 0.98 1818 INN 0.90 0.95 0.92 222 LD 0.97 0.92 0.94 149 LDS 0.91 0.45 0.61 22 LIT 0.92 0.96 0.94 314 MRK 0.78 0.88 0.82 32 ORG 0.82 0.88 0.85 113 PER 0.92 0.88 0.90 173 RR 0.95 0.99 0.97 142 RS 0.97 0.98 0.97 1245 ST 0.79 0.86 0.82 64 STR 0.75 0.80 0.77 15 UN 0.90 0.95 0.93 108 VO 0.80 0.83 0.81 71 VS 0.73 0.84 0.78 64 VT 0.93 0.97 0.95 290 micro avg 0.94 0.96 0.95 5322 macro avg 0.89 0.89 0.89 5322 weighted avg 0.95 0.96 0.95 5322
@misc{https://doi.org/10.48550/arxiv.2003.13016, doi = {10.48550/ARXIV.2003.13016}, url = {https://arxiv.org/abs/2003.13016}, author = {Leitner, Elena and Rehm, Georg and Moreno-Schneider, Julián}, keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {A Dataset of German Legal Documents for Named Entity Recognition}, publisher = {arXiv}, year = {2020}, copyright = {arXiv.org perpetual, non-exclusive license} }