[roberta-large-ner-english] 是一个英文NER模型,使用roberta-large模型在conll2003数据集上进行了微调。该模型在电子邮件/聊天数据上进行了验证,并在该类型数据上表现出色。特别是在不以大写字母开头的实体上,该模型似乎表现更好。
训练数据的分类如下:
Abbreviation | Description |
---|---|
O | Outside of a named entity |
MISC | Miscellaneous entity |
PER | Person’s name |
ORG | Organization |
LOC | Location |
为了简化,从原始conll2003数据集中移除了前缀B-或I-。我使用了原始conll2003的训练和测试数据集进行训练,使用"validation"数据集进行验证。这样得到的数据集大小为:
Train | Validation |
---|---|
17494 | 3250 |
加载camembert-ner及其子词分词器:
from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/roberta-large-ner-english") model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/roberta-large-ner-english") ##### Process text sample (from wikipedia) from transformers import pipeline nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple") nlp("Apple was founded in 1976 by Steve Jobs, Steve Wozniak and Ronald Wayne to develop and sell Wozniak's Apple I personal computer") [{'entity_group': 'ORG', 'score': 0.99381506, 'word': ' Apple', 'start': 0, 'end': 5}, {'entity_group': 'PER', 'score': 0.99970853, 'word': ' Steve Jobs', 'start': 29, 'end': 39}, {'entity_group': 'PER', 'score': 0.99981767, 'word': ' Steve Wozniak', 'start': 41, 'end': 54}, {'entity_group': 'PER', 'score': 0.99956465, 'word': ' Ronald Wayne', 'start': 59, 'end': 71}, {'entity_group': 'PER', 'score': 0.9997918, 'word': ' Wozniak', 'start': 92, 'end': 99}, {'entity_group': 'MISC', 'score': 0.99956393, 'word': ' Apple I', 'start': 102, 'end': 109}]
在conll2003验证数据集上计算的模型性能(基于标记预测)
entity | precision | recall | f1 |
---|---|---|---|
PER | 0.9914 | 0.9927 | 0.9920 |
ORG | 0.9627 | 0.9661 | 0.9644 |
LOC | 0.9795 | 0.9862 | 0.9828 |
MISC | 0.9292 | 0.9262 | 0.9277 |
Overall | 0.9740 | 0.9766 | 0.9753 |
在私有数据集(电子邮件、聊天、非正式讨论)上计算的模型性能(基于单词预测)
entity | precision | recall | f1 |
---|---|---|---|
PER | 0.8823 | 0.9116 | 0.8967 |
ORG | 0.7694 | 0.7292 | 0.7487 |
LOC | 0.8619 | 0.7768 | 0.8171 |
相比之下,相同私有数据集上的Spacy(en_core_web_trf-3.2.0)模型的性能如下:
entity | precision | recall | f1 |
---|---|---|---|
PER | 0.9146 | 0.8287 | 0.8695 |
ORG | 0.7655 | 0.6437 | 0.6993 |
LOC | 0.8727 | 0.6180 | 0.7236 |