camembert-ner: 基于camemBERT进行NER任务的模型微调。

介绍

[camembert-ner]是一个通过对维基NER数据集进行了camemBERT微调的NER模型。该模型使用了维基NER法语数据集（约170,634句）。该模型经过了电子邮件/聊天数据的验证，在这类数据上表现出色。特别是该模型对于不以大写字母开头的实体识别效果更好。

训练数据

训练数据的分类如下：

Abbreviation	Description
O	Outside of a named entity
MISC	Miscellaneous entity
PER	Person’s name
ORG	Organization
LOC	Location

如何使用HuggingFace加载camembert-ner及其子词标记器：

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner")


##### Process text sample (from wikipedia)

from transformers import pipeline

nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
nlp("Apple est créée le 1er avril 1976 dans le garage de la maison d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak et Ronald Wayne14, puis constituée sous forme de société le 3 janvier 1977 à l'origine sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification de ses produits, le mot « computer » est retiré le 9 janvier 2015.")


[{'entity_group': 'ORG',
  'score': 0.9472818374633789,
  'word': 'Apple',
  'start': 0,
  'end': 5},
 {'entity_group': 'PER',
  'score': 0.9838564991950989,
  'word': 'Steve Jobs',
  'start': 74,
  'end': 85},
 {'entity_group': 'LOC',
  'score': 0.9831605950991312,
  'word': 'Los Altos',
  'start': 87,
  'end': 97},
 {'entity_group': 'LOC',
  'score': 0.9834540486335754,
  'word': 'Californie',
  'start': 100,
  'end': 111},
 {'entity_group': 'PER',
  'score': 0.9841555754343668,
  'word': 'Steve Jobs',
  'start': 115,
  'end': 126},
 {'entity_group': 'PER',
  'score': 0.9843501806259155,
  'word': 'Steve Wozniak',
  'start': 127,
  'end': 141},
 {'entity_group': 'PER',
  'score': 0.9841533899307251,
  'word': 'Ronald Wayne',
  'start': 144,
  'end': 157},
 {'entity_group': 'ORG',
  'score': 0.9468960364659628,
  'word': 'Apple Computer',
  'start': 243,
  'end': 257}]

模型性能（指标：seqeval）

总体

precision	recall	f1
0.8859	0.8971	0.8914

按实体

entity	precision	recall	f1
PER	0.9372	0.9598	0.9483
ORG	0.8099	0.8265	0.8181
LOC	0.8905	0.9005	0.8955
MISC	0.8175	0.8117	0.8146

对于可能感兴趣的人，这是一篇关于我如何利用该模型的结果，在电子邮件中训练LSTM模型进行签名检测的短文: https://medium.com/@jean-baptiste.polle/lstm-model-for-email-signature-detection-8e990384fefa

作者:

JB Polle

数据集大小:

1.23 GB