数据集:

ruanchaves/boun

语言:

en

计算机处理:

monolingual

语言创建人:

machine-generated

批注创建人:

expert-generated

源数据集:

original
中文

Dataset Card for BOUN

Dataset Summary

Dev-BOUN is a Development set that includes 500 manually segmented hashtags. These are selected from tweets about movies, tv shows, popular people, sports teams etc.

Test-BOUN is a Test set that includes 500 manually segmented hashtags. These are selected from tweets about movies, tv shows, popular people, sports teams etc.

Languages

English

Dataset Structure

Data Instances

{
    "index": 0,
    "hashtag": "tryingtosleep",
    "segmentation": "trying to sleep"
}

Data Fields

  • index : a numerical index.
  • hashtag : the original hashtag.
  • segmentation : the gold segmentation for the hashtag.

Dataset Creation

  • All hashtag segmentation and identifier splitting datasets on this profile have the same basic fields: hashtag and segmentation or identifier and segmentation .

  • The only difference between hashtag and segmentation or between identifier and segmentation are the whitespace characters. Spell checking, expanding abbreviations or correcting characters to uppercase go into other fields.

  • There is always whitespace between an alphanumeric character and a sequence of any special characters ( such as _ , : , ~ ).

  • If there are any annotations for named entity recognition and other token classification tasks, they are given in a spans field.

Additional Information

Citation Information

@article{celebi2018segmenting,
  title={Segmenting hashtags and analyzing their grammatical structure},
  author={Celebi, Arda and {\"O}zg{\"u}r, Arzucan},
  journal={Journal of the Association for Information Science and Technology},
  volume={69},
  number={5},
  pages={675--686},
  year={2018},
  publisher={Wiley Online Library}
}

Contributions

This dataset was added by @ruanchaves while developing the hashformers library.