数据集:

strombergnlp/twitter_pos

任务:

标记分类

子任务:

part-of-speech

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

许可:

cc-by-4.0

数据集介绍文件清单

英文

"twitter-pos" 数据集卡片

数据集摘要

部分词性标注是基本的自然语言处理任务。然而，Twitter文本很难进行词性标注：它具有噪声、语言错误和特殊风格。此数据集包含了两个英文PoS标注的推文数据集：

Ritter，包含训练/开发/测试数据
Foster，包含开发/测试数据

分割在Derczynski的论文中定义，但数据来自Ritter和Foster。

Ritter: https://aclanthology.org/D11-1141.pdf ，
Foster: https://www.aaai.org/ocs/index.php/ws/aaaiw11/paper/download/3912/4191

支持的任务和排行榜

Part of speech tagging on Ritter

语言

英语，非特定地区，bcp47:en

数据集结构

数据实例

"train"的一个示例如下所示。

{'id': '0', 'tokens': ['Antick', 'Musings', 'post', ':', 'Book-A-Day', '2010', '#', '243', '(', '10/4', ')', '--', 'Gray', 'Horses', 'by', 'Hope', 'Larson', 'http://bit.ly/as8fvc'], 'pos_tags': [23, 23, 22, 9, 23, 12, 22, 12, 5, 12, 6, 9, 23, 23, 16, 23, 23, 51]}

数据字段

数据字段在所有数据分割中都是相同的。

twitter-pos

id: 字符串特征
tokens: 字符串特征列表
pos_tags: 分类标签列表（int）。完整的标签集与索引对应：

数据分割

name	tokens	sentences
ritter train	10652	551
ritter dev	2242	118
ritter test	2291	118
foster dev	2998	270
foster test	2841	250

数据集创建

策划理由

More Information Needed

源数据

初始数据收集和标准化

More Information Needed

源语言制片人是谁?

More Information Needed

注释

注释过程

More Information Needed

注释者是谁？

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

附加信息

数据集策划者

More Information Needed

许可信息

引用信息

@inproceedings{ritter2011named,
  title={Named entity recognition in tweets: an experimental study},
  author={Ritter, Alan and Clark, Sam and Etzioni, Oren and others},
  booktitle={Proceedings of the 2011 conference on empirical methods in natural language processing},
  pages={1524--1534},
  year={2011}
}

@inproceedings{foster2011hardtoparse,
  title={\# hardtoparse: POS Tagging and Parsing the Twitterverse},
  author={Foster, Jennifer and Cetinoglu, Ozlem and Wagner, Joachim and Le Roux, Joseph and Hogan, Stephen and Nivre, Joakim and Hogan, Deirdre and Van Genabith, Josef},
  booktitle={Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence},
  year={2011}
}

@inproceedings{derczynski2013twitter,
  title={Twitter part-of-speech tagging for all: Overcoming sparse and noisy data},
  author={Derczynski, Leon and Ritter, Alan and Clark, Sam and Bontcheva, Kalina},
  booktitle={Proceedings of the international conference recent advances in natural language processing ranlp 2013},
  pages={198--206},
  year={2013}
}

贡献

作者上传（ @leondz ）

作者:

strombergnlp

数据集大小:

21.25 KB

"twitter-pos" 数据集卡片

数据集摘要

支持的任务和排行榜

语言

数据集结构

数据实例

数据字段

数据分割

数据集创建

策划理由

源数据

注释

个人和敏感信息

使用数据的注意事项

数据的社会影响

有关偏差的讨论

其他已知限制

附加信息

数据集策划者

许可信息

引用信息

贡献