数据集:

strombergnlp/twitter_pos

任务:

标记分类

子任务:

part-of-speech

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for "twitter-pos"

Dataset Summary

Part-of-speech information is basic NLP task. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. This dataset contains two datasets for English PoS tagging for tweets:

Ritter, with train/dev/test
Foster, with dev/test

Splits defined in the Derczynski paper, but the data is from Ritter and Foster.

Ritter: https://aclanthology.org/D11-1141.pdf ,
Foster: https://www.aaai.org/ocs/index.php/ws/aaaiw11/paper/download/3912/4191

Supported Tasks and Leaderboards

Part of speech tagging on Ritter

Languages

English, non-region-specific. bcp47:en

Dataset Structure

Data Instances

An example of 'train' looks as follows.

{'id': '0', 'tokens': ['Antick', 'Musings', 'post', ':', 'Book-A-Day', '2010', '#', '243', '(', '10/4', ')', '--', 'Gray', 'Horses', 'by', 'Hope', 'Larson', 'http://bit.ly/as8fvc'], 'pos_tags': [23, 23, 22, 9, 23, 12, 22, 12, 5, 12, 6, 9, 23, 23, 16, 23, 23, 51]}

Data Fields

The data fields are the same among all splits.

twitter-pos

id : a string feature.
tokens : a list of string features.
pos_tags : a list of classification labels ( int ). Full tagset with indices:

Data Splits

name	tokens	sentences
ritter train	10652	551
ritter dev	2242	118
ritter test	2291	118
foster dev	2998	270
foster test	2841	250

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Additional Information

Dataset Curators

More Information Needed

Licensing Information

Citation Information

@inproceedings{ritter2011named,
  title={Named entity recognition in tweets: an experimental study},
  author={Ritter, Alan and Clark, Sam and Etzioni, Oren and others},
  booktitle={Proceedings of the 2011 conference on empirical methods in natural language processing},
  pages={1524--1534},
  year={2011}
}

@inproceedings{foster2011hardtoparse,
  title={\# hardtoparse: POS Tagging and Parsing the Twitterverse},
  author={Foster, Jennifer and Cetinoglu, Ozlem and Wagner, Joachim and Le Roux, Joseph and Hogan, Stephen and Nivre, Joakim and Hogan, Deirdre and Van Genabith, Josef},
  booktitle={Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence},
  year={2011}
}

@inproceedings{derczynski2013twitter,
  title={Twitter part-of-speech tagging for all: Overcoming sparse and noisy data},
  author={Derczynski, Leon and Ritter, Alan and Clark, Sam and Bontcheva, Kalina},
  booktitle={Proceedings of the international conference recent advances in natural language processing ranlp 2013},
  pages={198--206},
  year={2013}
}

Contributions

Author uploaded ( @leondz )

作者:

strombergnlp

数据集大小:

21.25 KB

Dataset Card for "twitter-pos"

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions