模型:

novakat/nerkor-cars-onpp-hubert

语言:

hu

其他:

bert

许可:

gpl
英文

匈牙利命名实体识别模型与OntoNotes5 +更多实体类型

  • 使用的预训练模型:SZTAKI-HLT/hubert-base-cc
  • 在NerKor+CARS-ONPP语料库上进行微调

限制

  • 最大序列长度 = 448

训练数据

底层语料库, NerKor+CARS-OntoNotes++ ,是从 NYTK-NerKor 派生而来的,后者是一个匈牙利黄金标准的命名实体标注语料库,包含约1百万个标记。它还包括了来自 hvg.hu 新闻档案馆的大约12k个关于机动车辆(汽车,公共汽车,摩托车)的文本标记。尽管在NYTK-NerKor的标注中遵循了CoNLL2002的标注规范,只有四个命名实体类别( PER , LOC , MISC , ORG ),但这个版本的语料库包含了30多种实体类型,包括所有在[OntoNotes 5.0]英文命名实体标注中使用的实体类型。新的标注详细说明了 LOC 和 MISC 实体类型的子类型,并包括了对时间和日期、数量、语言、国籍或宗教或政治团体等非命名实体的标注。这份注解还详细介绍了在 Ontonotes 5 注解中不存在的其他实体子类型(见下文)。

从OntoNotes 5.0注释中导出的标签

根据以下一组类型进行注释:

PER = PERSON People, including fictional
FAC = FACILITY Buildings, airports, highways, bridges, etc.
ORG = ORGANIZATION Companies, agencies, institutions, etc.
GPE Geopolitical entites: countries, cities, states
LOC = LOCATION Non-GPE locations, mountain ranges, bodies of water
PROD = PRODUCT Vehicles, weapons, foods, etc. (Not services)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART Titles of books, songs, etc.
LAW Named documents made into laws

下列内容也以类似名称的方式进行了注释:

NORP Nationalities or religious or political groups
LANGUAGE Any named language
DATE Absolute or relative dates or periods
TIME Times smaller than a day
PERCENT Percentage (including "%")
MONEY Monetary values, including unit
QUANTITY Measurements, as of weight or distance
ORDINAL "first", "second"
CARDINAL Numerals that do not fall under another type

其他标签(不在OntoNotes 5中)

类型为 MISC 的进一步子类型的名称:

AWARD Awards and prizes
CAR Cars and other motor vehicles
MEDIA Media outlets, TV channels, news portals
SMEDIA Social media platforms
PROJ Projects and initiatives
MISC Unresolved subtypes of MISC entities
MISC-ORG Organization-like unresolved subtypes of MISC entities

进一步的非命名实体:

DUR Time duration
AGE Age
ID Identifier

如果您使用了该模型,请引用:

@inproceedings{novak-novak-2022-nerkor,
    title = "{N}er{K}or+{C}ars-{O}nto{N}otes++",
    author = "Nov{\'a}k, Attila  and
      Nov{\'a}k, Borb{\'a}la",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.203",
    pages = "1907--1916",
    abstract = "In this paper, we present an upgraded version of the Hungarian NYTK-NerKor named entity corpus, which contains about twice as many annotated spans and 7 times as many distinct entity types as the original version. We used an extended version of the OntoNotes 5 annotation scheme including time and numerical expressions. NerKor is the newest and biggest NER corpus for Hungarian containing diverse domains. We applied cross-lingual transfer of NER models trained for other languages based on multilingual contextual language models to preannotate the corpus. We corrected the annotation semi-automatically and manually. Zero-shot preannotation was very effective with about 0.82 F1 score for the best model. We also added a 12000-token subcorpus on cars and other motor vehicles. We trained and release a transformer-based NER tagger for Hungarian using the annotation in the new corpus version, which provides similar performance to an identical model trained on the original version of the corpus.",
}