英文

hr500k训练语料库包含506,457个克罗地亚标记的词语,手动标注了分词、句子划分、形态句法标注、词形还原、命名实体和依赖句法的层次。

在句子级别上,该数据集包含20159个训练样本、1963个验证样本和2672个测试样本。每个样本代表一个句子,包括以下特征:句子ID('sent_id'),句子文本('text'),词语列表('tokens'),词形还原列表('lemmas'),MULTEXT-East标签列表('xpos_tags'),UPOS标签列表('upos_tags'),形态特征列表('feats')和IOB标签列表('iob_tags')。数据的一个子集还包含通用依赖关系('ud'),包括7498个训练样本、649个验证样本和742个测试样本。

有三个可用的数据集配置,分别是'ner'、'upos'和'ud',对应的特征被编码为类标签。如果没有指定配置,则默认为'ner'。

如果您在研究中使用了该数据集,请引用以下论文:

Bibtex	@InProceedings{LJUBEI16.340,
  author = {Nikola Ljubešić and Filip Klubička and Željko Agić and Ivo-Pavao Jazbec},
  title = {New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portorož, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Sara Goggi and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  language = {english}
 }