数据集:

clarin-pl/nkjp-pos

子任务:

part-of-speech

语言:

pl

计算机处理:

monolingual

语言创建人:

other

批注创建人:

expert-generated

源数据集:

original

许可:

gpl-3.0
英文

nkjp-pos

描述

NKJP-POS 是波兰国家语料库的一部分 (Narodowy Korpus Języka Polskiego)。其目标是进行词性标注,如名词、动词、形容词、副词等。在创建语料库的过程中,使用人工对来自不同来源、涵盖多个领域和体裁的文本进行了标注。

任务 (输入、输出和度量)

词性标注 (POS标注) - 为文本中的单词标注其对应的词性。

输入('tokens'列): 标记序列

输出('pos_tags'列): 预测的标记类序列(有35个可能的类,详细描述在注释指南中)

度量: F1得分 (seqeval)

示例:

输入: ['Zarejestruj', 'się', 'jako', 'bezrobotny', '.']

输入 (由DeepL翻译): 注册为失业人员。

输出: ['impt', 'qub', 'conj', 'subst', 'interp']

数据拆分

Subset Cardinality (sentences)
train 78219
dev 0
test 7444

类分布

Class train dev test
subst 0.27345 - 0.27656
interp 0.18101 - 0.17944
adj 0.10611 - 0.10919
prep 0.09567 - 0.09547
qub 0.05670 - 0.05491
fin 0.04939 - 0.04648
praet 0.04409 - 0.04348
conj 0.03711 - 0.03724
adv 0.03512 - 0.03333
inf 0.01591 - 0.01547
comp 0.01476 - 0.01439
num 0.01322 - 0.01436
ppron3 0.01111 - 0.01018
ppas 0.01086 - 0.01085
ger 0.00961 - 0.01050
brev 0.00856 - 0.01181
ppron12 0.00670 - 0.00665
aglt 0.00629 - 0.00602
pred 0.00539 - 0.00540
pact 0.00454 - 0.00452
bedzie 0.00229 - 0.00243
pcon 0.00218 - 0.00189
impt 0.00203 - 0.00226
siebie 0.00177 - 0.00158
imps 0.00174 - 0.00177
interj 0.00131 - 0.00102
xxx 0.00070 - 0.00048
adjp 0.00069 - 0.00065
winien 0.00068 - 0.00057
adja 0.00048 - 0.00058
pant 0.00012 - 0.00018
burk 0.00011 - 0.00006
numcol 0.00011 - 0.00013
depr 0.00010 - 0.00004
adjc 0.00007 - 0.00008

引用

@book{przepiorkowski_narodowy_2012,
title = {Narodowy korpus języka polskiego},
isbn = {978-83-01-16700-4},
language = {pl},
publisher = {Wydawnictwo Naukowe PWN},
editor = {Przepiórkowski, Adam and Bańko, Mirosław and Górski, Rafał L. and Lewandowska-Tomaszczyk, Barbara},
year = {2012}
}

许可证

GNU GPL v.3

链接

HuggingFace

Source

Paper

示例

加载

from pprint import pprint

from datasets import load_dataset

dataset = load_dataset("clarin-pl/nkjp-pos")
pprint(dataset['train'][5000])

# {'id': '130-2-900005_morph_49.49-s',
#  'pos_tags': [16, 4, 3, 30, 12, 18, 3, 16, 14, 6, 14, 26, 1, 30, 12],
#  'tokens': ['Najwyraźniej',
#             'źle',
#             'ocenił',
#             'odległość',
#             ',',
#             'bo',
#             'zderzył',
#             'się',
#             'z',
#             'jadącą',
#             'z',
#             'naprzeciwka',
#             'ciężarową',
#             'scanią',
#             '.']}

评估

import random
from pprint import pprint

from datasets import load_dataset, load_metric

dataset = load_dataset("clarin-pl/nkjp-pos")
references = dataset["test"]["pos_tags"]

# generate random predictions
predictions = [
    [
        random.randrange(dataset["train"].features["pos_tags"].feature.num_classes)
        for _ in range(len(labels))
    ]
    for labels in references
]

# transform to original names of labels
references_named = [
    [dataset["train"].features["pos_tags"].feature.names[label] for label in labels]
    for labels in references
]
predictions_named = [
    [dataset["train"].features["pos_tags"].feature.names[label] for label in labels]
    for labels in predictions
]

# transform to BILOU scheme
references_named = [
    [f"U-{label}" if label != "O" else label for label in labels]
    for labels in references_named
]
predictions_named = [
    [f"U-{label}" if label != "O" else label for label in labels]
    for labels in predictions_named
]

# utilise seqeval to evaluate
seqeval = load_metric("seqeval")
seqeval_score = seqeval.compute(
    predictions=predictions_named,
    references=references_named,
    scheme="BILOU",
    mode="strict",
)

pprint(seqeval_score, depth=1)

# {'adj': {...},
#  'adja': {...},
#  'adjc': {...},
#  'adjp': {...},
#  'adv': {...},
#  'aglt': {...},
#  'bedzie': {...},
#  'brev': {...},
#  'burk': {...},
#  'comp': {...},
#  'conj': {...},
#  'depr': {...},
#  'fin': {...},
#  'ger': {...},
#  'imps': {...},
#  'impt': {...},
#  'inf': {...},
#  'interj': {...},
#  'interp': {...},
#  'num': {...},
#  'numcol': {...},
#  'overall_accuracy': 0.027855061488566583,
#  'overall_f1': 0.027855061488566583,
#  'overall_precision': 0.027855061488566583,
#  'overall_recall': 0.027855061488566583,
#  'pact': {...},
#  'pant': {...},
#  'pcon': {...},
#  'ppas': {...},
#  'ppron12': {...},
#  'ppron3': {...},
#  'praet': {...},
#  'pred': {...},
#  'prep': {...},
#  'qub': {...},
#  'siebie': {...},
#  'subst': {...},
#  'winien': {...},
#  'xxx': {...}}