数据集:
Emanuel/UD_Portuguese-Bosque
语言:
pt此数据集已通过AutoNLP自动处理用于pos-tag-bosque项目。
数据集的BCP-47代码为pt。
数据集的示例如下所示:
[ { "tags": [ 5, 7, 0 ], "tokens": [ "Um", "revivalismo", "refrescante" ] }, { "tags": [ 5, 11, 11, 11, 3, 5, 7, 1, 5, 7, 0, 12 ], "tokens": [ "O", "7", "e", "Meio", "\u00e9", "um", "ex-libris", "de", "a", "noite", "algarvia", "." ] } ]
数据集具有以下字段(也称为"特征"):
{ "tags": "Sequence(feature=ClassLabel(num_classes=17, names=['ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X'], names_file=None, id=None), length=-1, id=None)", "tokens": "Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)" }
此数据集被拆分为训练集和验证集。拆分大小如下:
Split name | Num samples |
---|---|
train | 8328 |
valid | 476 |