数据集:

ruanchaves/stan_large

语言:

en

计算机处理:

monolingual

语言创建人:

machine-generated

批注创建人:

expert-generated

源数据集:

original

许可:

agpl-3.0
英文

STAN Large 数据集卡片

数据集摘要

下面的描述摘自 Maddela 等人的论文《Multi-task Pairwise Neural Ranking for Hashtag Segmentation》。

"STAN Large 是我们的新的专家策划数据集,包括了与同一 Stanford 数据集中的 12,594 个独特的英文哈希标签及其相关推文。STAN Small 是以前研究中最常用的数据集。然而,在重新检查后,我们发现在该数据集中有6.8%的哈希标签的注释错误,这在现有最先进模型的错误率只有约为10%的情况下是相当显著的。大多数错误与命名实体有关。例如,"#lionhead",它指的是“Lionhead”视频游戏公司,被标记为“lion head”。

因此,我们构建了包含12,594个哈希标签的 STAN Large 数据集,并对人工注释进行了额外的质量控制。"

语言

英语

数据集结构

数据实例

{
    "index": 15,
    "hashtag": "PokemonPlatinum",
    "segmentation": "Pokemon Platinum",
    "alternatives": {
        "segmentation": [
            "Pokemon platinum"
        ]
    }
}

数据字段

  • index : 数字索引.
  • hashtag : 原始哈希标签.
  • segmentation : 哈希标签的黄金分割.
  • alternatives : 其他也被接受为黄金分割的分割方式.

尽管 segmentation 与 hashtag 以外的空格完全相同,但 alternatives 中的分割方式可能会更正为大写字母。

数据集创建

  • 此配置文件中的所有哈希标签分割和标识符分割数据集都具有相同的基本字段:hashtag 和 segmentation 或 identifier 和 segmentation。

  • 标签hashtag和分割segmentation之间或标识符identifier和分割segmentation之间的唯一区别是空格字符。拼写检查、扩展缩写或更正为大写字符都将在其他字段中进行。

  • 字母数字字符与任何特殊字符序列(例如_,:,~)之间始终有空格。

  • 如果有任何用于命名实体识别和其他标记分类任务的注释,则以spans字段给出。

其他信息

引用信息

@inproceedings{maddela-etal-2019-multi,
    title = "Multi-task Pairwise Neural Ranking for Hashtag Segmentation",
    author = "Maddela, Mounica  and
      Xu, Wei  and
      Preo{\c{t}}iuc-Pietro, Daniel",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P19-1242",
    doi = "10.18653/v1/P19-1242",
    pages = "2538--2549",
    abstract = "Hashtags are often employed on social media and beyond to add metadata to a textual utterance with the goal of increasing discoverability, aiding search, or providing additional semantics. However, the semantic content of hashtags is not straightforward to infer as these represent ad-hoc conventions which frequently include multiple words joined together and can include abbreviations and unorthodox spellings. We build a dataset of 12,594 hashtags split into individual segments and propose a set of approaches for hashtag segmentation by framing it as a pairwise ranking problem between candidate segmentations. Our novel neural approaches demonstrate 24.6{\%} error reduction in hashtag segmentation accuracy compared to the current state-of-the-art method. Finally, we demonstrate that a deeper understanding of hashtag semantics obtained through segmentation is useful for downstream applications such as sentiment analysis, for which we achieved a 2.6{\%} increase in average recall on the SemEval 2017 sentiment analysis dataset.",
}

贡献

此数据集由开发 hashformers 库中的 @ruanchaves 添加。