底层语料库, NerKor+CARS-OntoNotes++ ,是从 NYTK-NerKor 派生而来的,后者是一个匈牙利黄金标准的命名实体标注语料库,包含约1百万个标记。它还包括了来自 hvg.hu 新闻档案馆的大约12k个关于机动车辆(汽车,公共汽车,摩托车)的文本标记。尽管在NYTK-NerKor的标注中遵循了CoNLL2002的标注规范,只有四个命名实体类别( PER , LOC , MISC , ORG ),但这个版本的语料库包含了30多种实体类型,包括所有在[OntoNotes 5.0]英文命名实体标注中使用的实体类型。新的标注详细说明了 LOC 和 MISC 实体类型的子类型,并包括了对时间和日期、数量、语言、国籍或宗教或政治团体等非命名实体的标注。这份注解还详细介绍了在 Ontonotes 5 注解中不存在的其他实体子类型(见下文)。
根据以下一组类型进行注释:
PER | = PERSON People, including fictional |
FAC | = FACILITY Buildings, airports, highways, bridges, etc. |
ORG | = ORGANIZATION Companies, agencies, institutions, etc. |
GPE | Geopolitical entites: countries, cities, states |
LOC | = LOCATION Non-GPE locations, mountain ranges, bodies of water |
PROD | = PRODUCT Vehicles, weapons, foods, etc. (Not services) |
EVENT | Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART | Titles of books, songs, etc. |
LAW | Named documents made into laws |
下列内容也以类似名称的方式进行了注释:
NORP | Nationalities or religious or political groups |
LANGUAGE | Any named language |
DATE | Absolute or relative dates or periods |
TIME | Times smaller than a day |
PERCENT | Percentage (including "%") |
MONEY | Monetary values, including unit |
QUANTITY | Measurements, as of weight or distance |
ORDINAL | "first", "second" |
CARDINAL | Numerals that do not fall under another type |
类型为 MISC 的进一步子类型的名称:
AWARD | Awards and prizes |
CAR | Cars and other motor vehicles |
MEDIA | Media outlets, TV channels, news portals |
SMEDIA | Social media platforms |
PROJ | Projects and initiatives |
MISC | Unresolved subtypes of MISC entities |
MISC-ORG | Organization-like unresolved subtypes of MISC entities |
进一步的非命名实体:
DUR | Time duration |
AGE | Age |
ID | Identifier |
@inproceedings{novak-novak-2022-nerkor, title = "{N}er{K}or+{C}ars-{O}nto{N}otes++", author = "Nov{\'a}k, Attila and Nov{\'a}k, Borb{\'a}la", booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", month = jun, year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.203", pages = "1907--1916", abstract = "In this paper, we present an upgraded version of the Hungarian NYTK-NerKor named entity corpus, which contains about twice as many annotated spans and 7 times as many distinct entity types as the original version. We used an extended version of the OntoNotes 5 annotation scheme including time and numerical expressions. NerKor is the newest and biggest NER corpus for Hungarian containing diverse domains. We applied cross-lingual transfer of NER models trained for other languages based on multilingual contextual language models to preannotate the corpus. We corrected the annotation semi-automatically and manually. Zero-shot preannotation was very effective with about 0.82 F1 score for the best model. We also added a 12000-token subcorpus on cars and other motor vehicles. We trained and release a transformer-based NER tagger for Hungarian using the annotation in the new corpus version, which provides similar performance to an identical model trained on the original version of the corpus.", }