数据集卡片：Dev-Stanford

数据集概述

由Çelebi等人手动分割的1000个hashtag，用于开发目的，随机选取自Stanford Sentiment Tweet Corpus by Sentiment140。

语言

英文

数据集结构

数据实例

{
    "index": 15,
    "hashtag": "marathonmonday",
    "segmentation": "marathon monday"
}

数据字段

index : 数字索引。
hashtag : 原始hashtag。
segmentation : hashtag的黄金分割。

数据集创建

此配置文件中的所有hashtag分割和标识符拆分数据集具有相同的基本字段： hashtag 和 segmentation 或 identifier 和 segmentation 。
hashtag 和 segmentation ，以及 identifier 和 segmentation 之间的唯一区别是其中的空格字符。拼写检查、扩展缩写或将字符更正为大写字母的操作放在其他字段中。
在字母数字字符和任意特殊字符序列（例如 _ 、 : 、 ~ ）之间始终有空格。
如果对命名实体识别和其他标记分类任务有任何注释，则会提供在一个 spans 字段中。

附加信息

引用信息

@article{celebi2018segmenting,
  title={Segmenting hashtags and analyzing their grammatical structure},
  author={Celebi, Arda and {\"O}zg{\"u}r, Arzucan},
  journal={Journal of the Association for Information Science and Technology},
  volume={69},
  number={5},
  pages={675--686},
  year={2018},
  publisher={Wiley Online Library}
}

贡献

本数据集是由 @ruanchaves 在开发 hashformers 库时添加的。

作者:

ruanchaves

数据集大小:

5.87 KB