数据集:

cjvt/janes_tag

中文

Dataset Card for Janes-Tag

Dataset Summary

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of mostly tweets but also blogs, forums and news comments.

Languages

Code-switched/nonstandard Slovenian.

Dataset Structure

Data Instances

A sample instance from the dataset - each word is annotated with its form ( word ), lemma, MSD tag (XPOS), and IOB2-encoded named entity tag.

{
  'id': 'janes.news.rtvslo.279732.2',
  'words': ['Jst', 'mam', 'tud', 'dons', 'rojstn', 'dan', '.'],
  'lemmas': ['jaz', 'imeti', 'tudi', 'danes', 'rojsten', 'dan', '.'],
  'msds': ['mte:Pp1-sn', 'mte:Vmpr1s-n', 'mte:Q', 'mte:Rgp', 'mte:Agpmsay', 'mte:Ncmsan', 'mte:Z'],
  'nes': ['O', 'O', 'O', 'O', 'O', 'O', 'O']
}

Data Fields

  • id : unique identifier of the example;
  • words : words in the example;
  • lemmas : lemmas in the example;
  • msds : msds in the example;
  • nes : IOB2-encoded named entity tag (person, location, organization, misc, other)

Additional Information

Dataset Curators

Jakob Lenardič et al. (please see http://hdl.handle.net/11356/1732 for the full list)

Licensing Information

CC BY-SA 4.0.

Citation Information

@misc{janes_tag,
  title = {{CMC} training corpus Janes-Tag 3.0},
  author = {Lenardi{\v c}, Jakob and {\v C}ibej, Jaka and Arhar Holdt, {\v S}pela and Erjavec, Toma{\v z} and Fi{\v s}er, Darja and Ljube{\v s}i{\'c}, Nikola and Zupan, Katja and Dobrovoljc, Kaja},
  url = {http://hdl.handle.net/11356/1732},
  note = {Slovenian language resource repository {CLARIN}.{SI}},
  copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)},
  year = {2022}
}

Contributions

Thanks to @matejklemen for adding this dataset.