数据集:

clarin-pl/nkjp-pos

任务:

task_categories:other

子任务:

part-of-speech

语言:

计算机处理:

monolingual

大小:

size_categories:unknown

语言创建人:

other

批注创建人:

expert-generated

源数据集:

original

其他:

structure-prediction

许可:

gpl-3.0

数据集介绍文件清单

英文

nkjp-pos

描述

NKJP-POS 是波兰国家语料库的一部分 (Narodowy Korpus Języka Polskiego)。其目标是进行词性标注，如名词、动词、形容词、副词等。在创建语料库的过程中，使用人工对来自不同来源、涵盖多个领域和体裁的文本进行了标注。

任务 (输入、输出和度量)

词性标注 (POS标注) - 为文本中的单词标注其对应的词性。

输入('tokens'列): 标记序列

输出('pos_tags'列): 预测的标记类序列（有35个可能的类，详细描述在注释指南中）

度量: F1得分 (seqeval)

示例:

输入: ['Zarejestruj', 'się', 'jako', 'bezrobotny', '.']

输入 (由DeepL翻译): 注册为失业人员。

输出: ['impt', 'qub', 'conj', 'subst', 'interp']

数据拆分

Subset	Cardinality (sentences)
train	78219
dev	0
test	7444

类分布

Class	train	dev	test
subst	0.27345	-	0.27656
interp	0.18101	-	0.17944
adj	0.10611	-	0.10919
prep	0.09567	-	0.09547
qub	0.05670	-	0.05491
fin	0.04939	-	0.04648
praet	0.04409	-	0.04348
conj	0.03711	-	0.03724
adv	0.03512	-	0.03333
inf	0.01591	-	0.01547
comp	0.01476	-	0.01439
num	0.01322	-	0.01436
ppron3	0.01111	-	0.01018
ppas	0.01086	-	0.01085
ger	0.00961	-	0.01050
brev	0.00856	-	0.01181
ppron12	0.00670	-	0.00665
aglt	0.00629	-	0.00602
pred	0.00539	-	0.00540
pact	0.00454	-	0.00452
bedzie	0.00229	-	0.00243
pcon	0.00218	-	0.00189
impt	0.00203	-	0.00226
siebie	0.00177	-	0.00158
imps	0.00174	-	0.00177
interj	0.00131	-	0.00102
xxx	0.00070	-	0.00048
adjp	0.00069	-	0.00065
winien	0.00068	-	0.00057
adja	0.00048	-	0.00058
pant	0.00012	-	0.00018
burk	0.00011	-	0.00006
numcol	0.00011	-	0.00013
depr	0.00010	-	0.00004
adjc	0.00007	-	0.00008

引用

@book{przepiorkowski_narodowy_2012,
title = {Narodowy korpus języka polskiego},
isbn = {978-83-01-16700-4},
language = {pl},
publisher = {Wydawnictwo Naukowe PWN},
editor = {Przepiórkowski, Adam and Bańko, Mirosław and Górski, Rafał L. and Lewandowska-Tomaszczyk, Barbara},
year = {2012}
}

许可证

GNU GPL v.3

链接

HuggingFace

Source

Paper

示例

加载

from pprint import pprint

from datasets import load_dataset

dataset = load_dataset("clarin-pl/nkjp-pos")
pprint(dataset['train'][5000])

# {'id': '130-2-900005_morph_49.49-s',
#  'pos_tags': [16, 4, 3, 30, 12, 18, 3, 16, 14, 6, 14, 26, 1, 30, 12],
#  'tokens': ['Najwyraźniej',
#             'źle',
#             'ocenił',
#             'odległość',
#             ',',
#             'bo',
#             'zderzył',
#             'się',
#             'z',
#             'jadącą',
#             'z',
#             'naprzeciwka',
#             'ciężarową',
#             'scanią',
#             '.']}

评估

import random
from pprint import pprint

from datasets import load_dataset, load_metric

dataset = load_dataset("clarin-pl/nkjp-pos")
references = dataset["test"]["pos_tags"]

# generate random predictions
predictions = [
    [
        random.randrange(dataset["train"].features["pos_tags"].feature.num_classes)
        for _ in range(len(labels))
    ]
    for labels in references
]

# transform to original names of labels
references_named = [
    [dataset["train"].features["pos_tags"].feature.names[label] for label in labels]
    for labels in references
]
predictions_named = [
    [dataset["train"].features["pos_tags"].feature.names[label] for label in labels]
    for labels in predictions
]

# transform to BILOU scheme
references_named = [
    [f"U-{label}" if label != "O" else label for label in labels]
    for labels in references_named
]
predictions_named = [
    [f"U-{label}" if label != "O" else label for label in labels]
    for labels in predictions_named
]

# utilise seqeval to evaluate
seqeval = load_metric("seqeval")
seqeval_score = seqeval.compute(
    predictions=predictions_named,
    references=references_named,
    scheme="BILOU",
    mode="strict",
)

pprint(seqeval_score, depth=1)

# {'adj': {...},
#  'adja': {...},
#  'adjc': {...},
#  'adjp': {...},
#  'adv': {...},
#  'aglt': {...},
#  'bedzie': {...},
#  'brev': {...},
#  'burk': {...},
#  'comp': {...},
#  'conj': {...},
#  'depr': {...},
#  'fin': {...},
#  'ger': {...},
#  'imps': {...},
#  'impt': {...},
#  'inf': {...},
#  'interj': {...},
#  'interp': {...},
#  'num': {...},
#  'numcol': {...},
#  'overall_accuracy': 0.027855061488566583,
#  'overall_f1': 0.027855061488566583,
#  'overall_precision': 0.027855061488566583,
#  'overall_recall': 0.027855061488566583,
#  'pact': {...},
#  'pant': {...},
#  'pcon': {...},
#  'ppas': {...},
#  'ppron12': {...},
#  'ppron3': {...},
#  'praet': {...},
#  'pred': {...},
#  'prep': {...},
#  'qub': {...},
#  'siebie': {...},
#  'subst': {...},
#  'winien': {...},
#  'xxx': {...}}

作者:

clarin-pl

数据集大小:

25.68 MB