roberta-large-ner-english: roberta-large模型在NER任务上的微调结果

简介

[roberta-large-ner-english] 是一个英文NER模型，使用roberta-large模型在conll2003数据集上进行了微调。该模型在电子邮件/聊天数据上进行了验证，并在该类型数据上表现出色。特别是在不以大写字母开头的实体上，该模型似乎表现更好。

训练数据

训练数据的分类如下：

Abbreviation	Description
O	Outside of a named entity
MISC	Miscellaneous entity
PER	Person’s name
ORG	Organization
LOC	Location

为了简化，从原始conll2003数据集中移除了前缀B-或I-。我使用了原始conll2003的训练和测试数据集进行训练，使用"validation"数据集进行验证。这样得到的数据集大小为：

Train	Validation
17494	3250

如何在HuggingFace中使用camembert-ner

加载camembert-ner及其子词分词器：

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/roberta-large-ner-english")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/roberta-large-ner-english")


##### Process text sample (from wikipedia)

from transformers import pipeline

nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
nlp("Apple was founded in 1976 by Steve Jobs, Steve Wozniak and Ronald Wayne to develop and sell Wozniak's Apple I personal computer")


[{'entity_group': 'ORG',
  'score': 0.99381506,
  'word': ' Apple',
  'start': 0,
  'end': 5},
 {'entity_group': 'PER',
  'score': 0.99970853,
  'word': ' Steve Jobs',
  'start': 29,
  'end': 39},
 {'entity_group': 'PER',
  'score': 0.99981767,
  'word': ' Steve Wozniak',
  'start': 41,
  'end': 54},
 {'entity_group': 'PER',
  'score': 0.99956465,
  'word': ' Ronald Wayne',
  'start': 59,
  'end': 71},
 {'entity_group': 'PER',
  'score': 0.9997918,
  'word': ' Wozniak',
  'start': 92,
  'end': 99},
 {'entity_group': 'MISC',
  'score': 0.99956393,
  'word': ' Apple I',
  'start': 102,
  'end': 109}]

模型性能

在conll2003验证数据集上计算的模型性能（基于标记预测）

entity	precision	recall	f1
PER	0.9914	0.9927	0.9920
ORG	0.9627	0.9661	0.9644
LOC	0.9795	0.9862	0.9828
MISC	0.9292	0.9262	0.9277
Overall	0.9740	0.9766	0.9753

在私有数据集（电子邮件、聊天、非正式讨论）上计算的模型性能（基于单词预测）

entity	precision	recall	f1
PER	0.8823	0.9116	0.8967
ORG	0.7694	0.7292	0.7487
LOC	0.8619	0.7768	0.8171

相比之下，相同私有数据集上的Spacy（en_core_web_trf-3.2.0）模型的性能如下：

entity	precision	recall	f1
PER	0.9146	0.8287	0.8695
ORG	0.7655	0.6437	0.6993
LOC	0.8727	0.6180	0.7236

作者:

Yih-Dar SHIEH

数据集大小:

1.32 GB