数据集:

cjvt/janes_preklop

中文

Dataset Card for Janes-Preklop

Dataset Summary

Janes-Preklop is a corpus of Slovene tweets that is manually annotated for code-switching: the use of words from two or more languages within one sentence or utterance.

Languages

Code-switched Slovenian.

Dataset Structure

Data Instances

A sample instance from the dataset - each word is annotated with its language, either "default" (Slovenian/unclassifiable), en (English), de (German), hbs (Serbo-Croatian), sp (Spanish), la (Latin), ar (Arabic), fr (French), it (Italian), or pt (Portuguese).

{
    'id': 'tid.397447931558895616', 
    'words': ['Brad', 'Pitt', 'na', 'Planet', 'TV', '.', 'U', 'are', 'welcome', ';)'], 
    'language': ['default', 'default', 'default', 'default', 'default', 'default', 'B-en', 'I-en', 'I-en', 'I-en']
}

Data Fields

  • id : unique identifier of the example;
  • words : words in the sentence;
  • language : language of each word.

Additional Information

Dataset Curators

Špela Reher, Tomaž Erjavec, Darja Fišer.

Licensing Information

CC BY-SA 4.0.

Citation Information

@misc{janes_preklop,
  title = {Tweet code-switching corpus Janes-Preklop 1.0},
  author = {Reher, {\v S}pela and Erjavec, Toma{\v z} and Fi{\v s}er, Darja},
  url = {http://hdl.handle.net/11356/1154},
  note = {Slovenian language resource repository {CLARIN}.{SI}},
  copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)},
  issn = {2820-4042},
  year = {2017}
}

Contributions

Thanks to @matejklemen for adding this dataset.