数据集:
strombergnlp/twitter_pos
任务:
标记分类子任务:
part-of-speech语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
cc-by-4.0部分词性标注是基本的自然语言处理任务。然而,Twitter文本很难进行词性标注:它具有噪声、语言错误和特殊风格。此数据集包含了两个英文PoS标注的推文数据集:
分割在Derczynski的论文中定义,但数据来自Ritter和Foster。
英语,非特定地区,bcp47:en
"train"的一个示例如下所示。
{'id': '0', 'tokens': ['Antick', 'Musings', 'post', ':', 'Book-A-Day', '2010', '#', '243', '(', '10/4', ')', '--', 'Gray', 'Horses', 'by', 'Hope', 'Larson', 'http://bit.ly/as8fvc'], 'pos_tags': [23, 23, 22, 9, 23, 12, 22, 12, 5, 12, 6, 9, 23, 23, 16, 23, 23, 51]}
数据字段在所有数据分割中都是相同的。
twitter-posname | tokens | sentences |
---|---|---|
ritter train | 10652 | 551 |
ritter dev | 2242 | 118 |
ritter test | 2291 | 118 |
foster dev | 2998 | 270 |
foster test | 2841 | 250 |
初始数据收集和标准化
源语言制片人是谁?@inproceedings{ritter2011named, title={Named entity recognition in tweets: an experimental study}, author={Ritter, Alan and Clark, Sam and Etzioni, Oren and others}, booktitle={Proceedings of the 2011 conference on empirical methods in natural language processing}, pages={1524--1534}, year={2011} } @inproceedings{foster2011hardtoparse, title={\# hardtoparse: POS Tagging and Parsing the Twitterverse}, author={Foster, Jennifer and Cetinoglu, Ozlem and Wagner, Joachim and Le Roux, Joseph and Hogan, Stephen and Nivre, Joakim and Hogan, Deirdre and Van Genabith, Josef}, booktitle={Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence}, year={2011} } @inproceedings{derczynski2013twitter, title={Twitter part-of-speech tagging for all: Overcoming sparse and noisy data}, author={Derczynski, Leon and Ritter, Alan and Clark, Sam and Bontcheva, Kalina}, booktitle={Proceedings of the international conference recent advances in natural language processing ranlp 2013}, pages={198--206}, year={2013} }
作者上传( @leondz )