数据集:

ju-bezdek/conll2003-SK-NER

中文

Dataset Card for [Dataset Name]

Dataset Description

This is translated version of the original CONLL2003 dataset (translated from English to Slovak via Google translate) Annotation was done mostly automatically with word matching scripts. Records where some tags were not matched, were annotated manually (10%) Unlike the original Conll2003 dataset, this one contains only NER tags

Supported Tasks and Leaderboards

NER

labels:

  • 0: O
  • 1: B-PER
  • 2: I-PER
  • 3: B-ORG
  • 4: I-ORG
  • 5: B-LOC
  • 6: I-LOC
  • 7: B-MISC
  • 8: I-MISC

Languages

sk

Dataset Structure

Data Splits

train, test, val

Dataset Creation

Source Data

https://huggingface.co/datasets/conll2003

Annotations

Annotation process
  • Machine Translation
  • Machine pairing tags with reverse translation, and hardcoded rules (including phrase regex matching etc.)
  • Manual annotation of records that couldn't be automatically matched