数据集:

ruanchaves/snap

语言:

en

计算机处理:

monolingual

语言创建人:

machine-generated

批注创建人:

expert-generated

源数据集:

original
中文

Dataset Card for SNAP

Dataset Summary

Automatically segmented 803K SNAP Twitter Data Set hashtags with the heuristic described in the paper "Segmenting hashtags using automatically created training data".

Languages

English

Dataset Structure

Data Instances

{
    "index": 0,
    "hashtag": "BrandThunder",
    "segmentation": "Brand Thunder"
}

Data Fields

  • index : a numerical index.
  • hashtag : the original hashtag.
  • segmentation : the gold segmentation for the hashtag.

Dataset Creation

  • All hashtag segmentation and identifier splitting datasets on this profile have the same basic fields: hashtag and segmentation or identifier and segmentation .

  • The only difference between hashtag and segmentation or between identifier and segmentation are the whitespace characters. Spell checking, expanding abbreviations or correcting characters to uppercase go into other fields.

  • There is always whitespace between an alphanumeric character and a sequence of any special characters ( such as _ , : , ~ ).

  • If there are any annotations for named entity recognition and other token classification tasks, they are given in a spans field.

Additional Information

Citation Information

@inproceedings{celebi2016segmenting,
  title={Segmenting hashtags using automatically created training data},
  author={Celebi, Arda and {\"O}zg{\"u}r, Arzucan},
  booktitle={Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)},
  pages={2981--2985},
  year={2016}
}

Contributions

This dataset was added by @ruanchaves while developing the hashformers library.