数据集:

clarin-pl/kpwr-ner

英文

KPWR-NER

描述

KPWR-NER 是Wroclaw理工大学波兰语语料库(Korpus Języka Polskiego Politechniki Wrocławskiej)的一部分。它的目标是对实体的细粒度类别进行命名实体识别。它是KPWr的“n82”版本,意味着类别数量限制为82个(原本为120个)。在创建语料库过程中,使用人工注释了来自各种来源、涵盖多个领域和体裁的文本。

任务(输入,输出和指标)

命名实体识别(NER) - 对文本中的实体进行标记,并指定其对应的类型。

输入('tokens'列):一系列标记

输出('ner'列):使用BIO符号标记法标记的预测标记序列(82种可能的类别,详细描述在注释指南中)

度量标准:F1分数(seqeval)

示例:

输入:[‘Roboty’, ‘mają’, ‘kilkanaście’, ‘lat’, ‘i’, ‘pochodzą’, ‘z’, ‘USA’, ‘,’, ‘Wysokie’, ‘napięcie’, ‘jest’, ‘dużo’, ‘młodsze’, ‘,’, ‘powstało’, ‘w’, ‘Niemczech’, ‘.’]

输入(由DeepL翻译):机器人已经存在十几年了,来自美国,高压来自德国,年轻多了。

输出:[‘B-nam_pro_title’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘B-nam_loc_gpe_country’, ‘O’, ‘B-nam_pro_title’, ‘I-nam_pro_title’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘B-nam_loc_gpe_country’, ‘O’]

数据划分

Subset Cardinality (sentences)
train 13959
dev 0
test 4323

类别分布(不包括“O”和“I-*”)

Class train validation test
B-nam_liv_person 0.21910 - 0.21422
B-nam_loc_gpe_city 0.10101 - 0.09865
B-nam_loc_gpe_country 0.07467 - 0.08059
B-nam_org_institution 0.05893 - 0.06005
B-nam_org_organization 0.04448 - 0.05553
B-nam_org_group_team 0.03492 - 0.03363
B-nam_adj_country 0.03410 - 0.03747
B-nam_org_company 0.02439 - 0.01716
B-nam_pro_media_periodic 0.02250 - 0.01896
B-nam_fac_road 0.01995 - 0.02144
B-nam_liv_god 0.01934 - 0.00790
B-nam_org_nation 0.01739 - 0.01828
B-nam_oth_tech 0.01724 - 0.01377
B-nam_pro_media_web 0.01709 - 0.00903
B-nam_fac_goe 0.01596 - 0.01445
B-nam_eve_human 0.01573 - 0.01761
B-nam_pro_title 0.01558 - 0.00790
B-nam_pro_brand 0.01543 - 0.01038
B-nam_org_political_party 0.01264 - 0.01309
B-nam_loc_gpe_admin1 0.01219 - 0.01445
B-nam_eve_human_sport 0.01174 - 0.01242
B-nam_pro_software 0.01091 - 0.02190
B-nam_adj 0.00963 - 0.01174
B-nam_loc_gpe_admin3 0.00888 - 0.01061
B-nam_pro_model_car 0.00873 - 0.00587
B-nam_loc_hydronym_river 0.00843 - 0.01151
B-nam_oth 0.00775 - 0.00497
B-nam_pro_title_document 0.00738 - 0.01986
B-nam_loc_astronomical 0.00730 - -
B-nam_oth_currency 0.00723 - 0.01151
B-nam_adj_city 0.00670 - 0.00948
B-nam_org_group_band 0.00587 - 0.00429
B-nam_loc_gpe_admin2 0.00565 - 0.00813
B-nam_loc_gpe_district 0.00504 - 0.00406
B-nam_loc_land_continent 0.00459 - 0.00722
B-nam_loc_country_region 0.00459 - 0.00090
B-nam_loc_land_mountain 0.00414 - 0.00203
B-nam_pro_title_book 0.00384 - 0.00248
B-nam_loc_historical_region 0.00376 - 0.00497
B-nam_loc 0.00361 - 0.00090
B-nam_eve 0.00361 - 0.00181
B-nam_org_group 0.00331 - 0.00406
B-nam_loc_land_island 0.00331 - 0.00248
B-nam_pro_media_tv 0.00316 - 0.00158
B-nam_liv_habitant 0.00316 - 0.00158
B-nam_eve_human_cultural 0.00316 - 0.00497
B-nam_pro_title_tv 0.00309 - 0.00542
B-nam_oth_license 0.00286 - 0.00248
B-nam_num_house 0.00256 - 0.00248
B-nam_pro_title_treaty 0.00248 - 0.00045
B-nam_fac_system 0.00248 - 0.00587
B-nam_loc_gpe_subdivision 0.00241 - 0.00587
B-nam_loc_land_region 0.00226 - 0.00248
B-nam_pro_title_album 0.00218 - 0.00158
B-nam_adj_person 0.00203 - 0.00406
B-nam_fac_square 0.00196 - 0.00135
B-nam_pro_award 0.00188 - 0.00519
B-nam_eve_human_holiday 0.00188 - 0.00203
B-nam_pro_title_song 0.00166 - 0.00158
B-nam_pro_media_radio 0.00151 - 0.00068
B-nam_pro_vehicle 0.00151 - 0.00090
B-nam_oth_position 0.00143 - 0.00226
B-nam_liv_animal 0.00143 - 0.00248
B-nam_pro 0.00135 - 0.00045
B-nam_oth_www 0.00120 - 0.00451
B-nam_num_phone 0.00120 - 0.00045
B-nam_pro_title_article 0.00113 - -
B-nam_oth_data_format 0.00113 - 0.00226
B-nam_fac_bridge 0.00105 - 0.00090
B-nam_liv_character 0.00098 - -
B-nam_pro_software_game 0.00090 - 0.00068
B-nam_loc_hydronym_lake 0.00090 - 0.00045
B-nam_loc_gpe_conurbation 0.00090 - -
B-nam_pro_media 0.00083 - 0.00181
B-nam_loc_land 0.00075 - 0.00045
B-nam_loc_land_peak 0.00075 - -
B-nam_fac_park 0.00068 - 0.00226
B-nam_org_organization_sub 0.00060 - 0.00068
B-nam_loc_hydronym 0.00060 - 0.00023
B-nam_loc_hydronym_sea 0.00045 - 0.00068
B-nam_loc_hydronym_ocean 0.00045 - 0.00023
B-nam_fac_goe_stop 0.00038 - 0.00090

引用

@inproceedings{broda-etal-2012-kpwr,
    title = "{KPW}r: Towards a Free Corpus of {P}olish",
    author = "Broda, Bartosz  and
      Marci{\'n}czuk, Micha{\l}  and
      Maziarz, Marek  and
      Radziszewski, Adam  and
      Wardy{\'n}ski, Adam",
    booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
    month = may,
    year = "2012",
    address = "Istanbul, Turkey",
    publisher = "European Language Resources Association (ELRA)",
    url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/965_Paper.pdf",
    pages = "3218--3222",
    abstract = "This paper presents our efforts aimed at collecting and annotating a free Polish corpus. The corpus will serve for us as training and testing material for experiments with Machine Learning algorithms. As others may also benefit from the resource, we are going to release it under a Creative Commons licence, which is hoped to remove unnecessary usage restrictions, but also to facilitate reproduction of our experimental results. The corpus is being annotated with various types of linguistic entities: chunks and named entities, selected syntactic and semantic relations, word senses and anaphora. We report on the current state of the project as well as our ultimate goals.",
}

许可

Creative Commons Attribution 3.0 Unported Licence

链接

HuggingFace

Source

Paper

KPWr annotation guidelines

KPWr annotation guidelines - named entities

示例

加载

from pprint import pprint

from datasets import load_dataset

dataset = load_dataset("clarin-pl/kpwr-ner")
pprint(dataset['train'][0])

# {'lemmas': ['roborally', 'czy', 'wysoki', 'napięcie', '?'],
#  'ner': [73, 160, 73, 151, 160],
#  'orth': ['subst:sg:nom:n',
#           'qub',
#           'adj:sg:nom:n:pos',
#           'subst:sg:nom:n',
#           'interp'],
#  'tokens': ['RoboRally', 'czy', 'Wysokie', 'napięcie', '?']}

评估

import random
from pprint import pprint

from datasets import load_dataset, load_metric

dataset = load_dataset("clarin-pl/kpwr-ner")
references = dataset["test"]["ner"]

# generate random predictions
predictions = [
    [
        random.randrange(dataset["train"].features["ner"].feature.num_classes)
        for _ in range(len(labels))
    ]
    for labels in references
]

# transform to original names of labels
references_named = [
    [dataset["train"].features["ner"].feature.names[label] for label in labels]
    for labels in references
]
predictions_named = [
    [dataset["train"].features["ner"].feature.names[label] for label in labels]
    for labels in predictions
]

# utilise seqeval to evaluate
seqeval = load_metric("seqeval")
seqeval_score = seqeval.compute(
    predictions=predictions_named, references=references_named, scheme="IOB2"
)

pprint(seqeval_score, depth=1)

# {'nam_adj': {...},
#  'nam_adj_city': {...},
#  'nam_adj_country': {...},
#  'nam_adj_person': {...},
#  'nam_eve': {...},
#  'nam_eve_human': {...},
#  'nam_eve_human_cultural': {...},
#  'nam_eve_human_holiday': {...},
#  'nam_eve_human_sport': {...},
#  'nam_fac_bridge': {...},
#  'nam_fac_goe': {...},
#  'nam_fac_goe_stop': {...},
#  'nam_fac_park': {...},
#  'nam_fac_road': {...},
#  'nam_fac_square': {...},
#  'nam_fac_system': {...},
#  'nam_liv_animal': {...},
#  'nam_liv_character': {...},
#  'nam_liv_god': {...},
#  'nam_liv_habitant': {...},
#  'nam_liv_person': {...},
#  'nam_loc': {...},
#  'nam_loc_astronomical': {...},
#  'nam_loc_country_region': {...},
#  'nam_loc_gpe_admin1': {...},
#  'nam_loc_gpe_admin2': {...},
#  'nam_loc_gpe_admin3': {...},
#  'nam_loc_gpe_city': {...},
#  'nam_loc_gpe_conurbation': {...},
#  'nam_loc_gpe_country': {...},
#  'nam_loc_gpe_district': {...},
#  'nam_loc_gpe_subdivision': {...},
#  'nam_loc_historical_region': {...},
#  'nam_loc_hydronym': {...},
#  'nam_loc_hydronym_lake': {...},
#  'nam_loc_hydronym_ocean': {...},
#  'nam_loc_hydronym_river': {...},
#  'nam_loc_hydronym_sea': {...},
#  'nam_loc_land': {...},
#  'nam_loc_land_continent': {...},
#  'nam_loc_land_island': {...},
#  'nam_loc_land_mountain': {...},
#  'nam_loc_land_peak': {...},
#  'nam_loc_land_region': {...},
#  'nam_num_house': {...},
#  'nam_num_phone': {...},
#  'nam_org_company': {...},
#  'nam_org_group': {...},
#  'nam_org_group_band': {...},
#  'nam_org_group_team': {...},
#  'nam_org_institution': {...},
#  'nam_org_nation': {...},
#  'nam_org_organization': {...},
#  'nam_org_organization_sub': {...},
#  'nam_org_political_party': {...},
#  'nam_oth': {...},
#  'nam_oth_currency': {...},
#  'nam_oth_data_format': {...},
#  'nam_oth_license': {...},
#  'nam_oth_position': {...},
#  'nam_oth_tech': {...},
#  'nam_oth_www': {...},
#  'nam_pro': {...},
#  'nam_pro_award': {...},
#  'nam_pro_brand': {...},
#  'nam_pro_media': {...},
#  'nam_pro_media_periodic': {...},
#  'nam_pro_media_radio': {...},
#  'nam_pro_media_tv': {...},
#  'nam_pro_media_web': {...},
#  'nam_pro_model_car': {...},
#  'nam_pro_software': {...},
#  'nam_pro_software_game': {...},
#  'nam_pro_title': {...},
#  'nam_pro_title_album': {...},
#  'nam_pro_title_article': {...},
#  'nam_pro_title_book': {...},
#  'nam_pro_title_document': {...},
#  'nam_pro_title_song': {...},
#  'nam_pro_title_treaty': {...},
#  'nam_pro_title_tv': {...},
#  'nam_pro_vehicle': {...},
#  'overall_accuracy': 0.006156203762418094,
#  'overall_f1': 0.0009844258777797407,
#  'overall_precision': 0.0005213624939842789,
#  'overall_recall': 0.008803611738148984}