数据集:

strombergnlp/twitter_pos

子任务:

part-of-speech

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

许可:

cc-by-4.0
中文

Dataset Card for "twitter-pos"

Dataset Summary

Part-of-speech information is basic NLP task. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. This dataset contains two datasets for English PoS tagging for tweets:

  • Ritter, with train/dev/test
  • Foster, with dev/test

Splits defined in the Derczynski paper, but the data is from Ritter and Foster.

Supported Tasks and Leaderboards

Languages

English, non-region-specific. bcp47:en

Dataset Structure

Data Instances

An example of 'train' looks as follows.

{'id': '0', 'tokens': ['Antick', 'Musings', 'post', ':', 'Book-A-Day', '2010', '#', '243', '(', '10/4', ')', '--', 'Gray', 'Horses', 'by', 'Hope', 'Larson', 'http://bit.ly/as8fvc'], 'pos_tags': [23, 23, 22, 9, 23, 12, 22, 12, 5, 12, 6, 9, 23, 23, 16, 23, 23, 51]}

Data Fields

The data fields are the same among all splits.

twitter-pos
  • id : a string feature.
  • tokens : a list of string features.
  • pos_tags : a list of classification labels ( int ). Full tagset with indices:

Data Splits

name tokens sentences
ritter train 10652 551
ritter dev 2242 118
ritter test 2291 118
foster dev 2998 270
foster test 2841 250

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

Citation Information

@inproceedings{ritter2011named,
  title={Named entity recognition in tweets: an experimental study},
  author={Ritter, Alan and Clark, Sam and Etzioni, Oren and others},
  booktitle={Proceedings of the 2011 conference on empirical methods in natural language processing},
  pages={1524--1534},
  year={2011}
}

@inproceedings{foster2011hardtoparse,
  title={\# hardtoparse: POS Tagging and Parsing the Twitterverse},
  author={Foster, Jennifer and Cetinoglu, Ozlem and Wagner, Joachim and Le Roux, Joseph and Hogan, Stephen and Nivre, Joakim and Hogan, Deirdre and Van Genabith, Josef},
  booktitle={Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence},
  year={2011}
}

@inproceedings{derczynski2013twitter,
  title={Twitter part-of-speech tagging for all: Overcoming sparse and noisy data},
  author={Derczynski, Leon and Ritter, Alan and Clark, Sam and Bontcheva, Kalina},
  booktitle={Proceedings of the international conference recent advances in natural language processing ranlp 2013},
  pages={198--206},
  year={2013}
}

Contributions

Author uploaded ( @leondz )