模型:

dslim/bert-large-NER

任务:

标记分类

类库:

PyTorch TensorFlow JAX Safetensors Transformers

数据集:

conll2003 3Aconll2003

语言:

其他:

bert Eval Results AutoTrain Compatible

预印本库:

arxiv:1810.04805

许可:

mit

模型介绍文件清单

英文

bert-large-NER

模型描述

bert-large-NER 是一种经过微调的BERT模型，可立即用于命名实体识别，并在NER任务中达到最先进的性能。它已经训练用于识别四种类型的实体：位置（LOC），组织（ORG），人物（PER）和杂项（MISC）。

具体来说，该模型是一个经过微调的bert-large-cased模型，经过了关于标准 CoNLL-2003 Named Entity Recognition 数据集的英文版本的训练。

如果您想要使用一个在相同数据集上微调的较小BERT模型，也可以使用 bert-base-NER 版本。

预期使用和限制

如何使用

您可以使用这个模型和 Transformers pipeline 来进行NER任务。

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

限制和偏见

这个模型的训练数据集是一些特定时间段内的带有实体标注的新闻文章。这可能对不同领域的所有用例都不具有良好的泛化能力。此外，模型有时会将子词标记为实体，可能需要对结果进行后处理来处理这些情况。

训练数据

该模型经过微调，训练数据为标准 CoNLL-2003 Named Entity Recognition 数据集的英文版本。

训练数据集区分实体的开始和继续，以便如果连续的两个实体属于相同类型，模型可以输出第二个实体开始的位置。与数据集一样，每个令牌将被分类为以下类之一：

Abbreviation	Description
O	Outside of a named entity
B-MIS	Beginning of a miscellaneous entity right after another miscellaneous entity
I-MIS	Miscellaneous entity
B-PER	Beginning of a person’s name right after another person’s name
I-PER	Person’s name
B-ORG	Beginning of an organization right after another organization
I-ORG	organization
B-LOC	Beginning of a location right after another location
I-LOC	Location

CoNLL-2003英文数据集统计信息

该数据集源自Reuters语料库，其中包含了路透社的新闻报道。您可以在CoNLL-2003的论文中了解有关该数据集创建过程的更多信息。

每种实体类型的训练示例数量为

Dataset	LOC	MISC	ORG	PER
Train	7140	3438	6321	6600
Dev	1837	922	1341	1842
Test	1668	702	1661	1617

数据集中的文章/句子/令牌数量为

Dataset	Articles	Sentences	Tokens
Train	946	14,987	203,621
Dev	216	3,466	51,362
Test	231	3,684	46,435

训练过程

该模型在一块 NVIDIA V100 GPU 上进行了训练，使用了来自 original BERT paper 的推荐超参数，该论文在CoNLL-2003 NER任务上训练和评估了该模型。

评估结果

metric	dev	test
f1	95.7	91.7
precision	95.3	91.2
recall	96.1	92.3

测试指标略低于官方的Google BERT结果，后者对文档上下文进行编码并尝试使用CRF进行实验。有关复制原始结果的详细信息，请参阅 here 。

BibTeX条目和引用信息

@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
               Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805},
  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
    title = "Introduction to the {C}o{NLL}-2003 Shared Task: Language-Independent Named Entity Recognition",
    author = "Tjong Kim Sang, Erik F.  and
      De Meulder, Fien",
    booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at {HLT}-{NAACL} 2003",
    year = "2003",
    url = "https://www.aclweb.org/anthology/W03-0419",
    pages = "142--147",
}

作者:

David S. Lim

数据集大小:

4.97 GB