数据集:

nkjp-ner

语言:

pl

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

other

批注创建人:

expert-generated

源数据集:

original

许可:

gpl-3.0
中文

Dataset Card for NJKP NER

Dataset Summary

A linguistic corpus is a collection of texts where one can find the typical use of a single word or a phrase, as well as their meaning and grammatical function. Nowadays, without access to a language corpus, it has become impossible to do linguistic research, to write dictionaries, grammars and language teaching books, to create search engines sensitive to Polish inflection, machine translation engines and software of advanced language technology. Language corpora have become an essential tool for linguists, but they are also helpful for software engineers, scholars of literature and culture, historians, librarians and other specialists of art and computer sciences. The manually annotated 1-million word subcorpus of the NJKP, available on GNU GPL v.3

Supported Tasks and Leaderboards

Named entity recognition

[More Information Needed]

Languages

Polish

Dataset Structure

Data Instances

Two tsv files (train, dev) with two columns (sentence, target) and one (test) with just one (sentence).

Data Fields

  • sentence
  • target

Data Splits

Data is splitted in train/dev/test split.

Dataset Creation

Curation Rationale

This dataset is one of nine evaluation tasks to improve polish language processing.

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

GNU GPL v.3

Citation Information

@book{przepiorkowski2012narodowy, title={Narodowy korpus j{\k{e}}zyka polskiego}, author={Przepi{'o}rkowski, Adam}, year={2012}, publisher={Naukowe PWN} }

Contributions

Thanks to @abecadel for adding this dataset.