Dataset Card for BOUN

Dataset Summary

Dev-BOUN is a Development set that includes 500 manually segmented hashtags. These are selected from tweets about movies, tv shows, popular people, sports teams etc.

Test-BOUN is a Test set that includes 500 manually segmented hashtags. These are selected from tweets about movies, tv shows, popular people, sports teams etc.

Languages

English

Dataset Structure

Data Instances

{
    "index": 0,
    "hashtag": "tryingtosleep",
    "segmentation": "trying to sleep"
}

Data Fields

index : a numerical index.
hashtag : the original hashtag.
segmentation : the gold segmentation for the hashtag.

Dataset Creation

All hashtag segmentation and identifier splitting datasets on this profile have the same basic fields: hashtag and segmentation or identifier and segmentation .
The only difference between hashtag and segmentation or between identifier and segmentation are the whitespace characters. Spell checking, expanding abbreviations or correcting characters to uppercase go into other fields.
There is always whitespace between an alphanumeric character and a sequence of any special characters ( such as _ , : , ~ ).
If there are any annotations for named entity recognition and other token classification tasks, they are given in a spans field.

Additional Information

Citation Information

@article{celebi2018segmenting,
  title={Segmenting hashtags and analyzing their grammatical structure},
  author={Celebi, Arda and {\"O}zg{\"u}r, Arzucan},
  journal={Journal of the Association for Information Science and Technology},
  volume={69},
  number={5},
  pages={675--686},
  year={2018},
  publisher={Wiley Online Library}
}

Contributions

This dataset was added by @ruanchaves while developing the hashformers library.

作者:

ruanchaves

数据集大小:

6.33 KB