数据集:
ruanchaves/stan_small
语言:
en计算机处理:
monolingual语言创建人:
machine-generated批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:1501.03210许可:
license:unknownManually Annotated Stanford Sentiment Analysis Dataset by Bansal et al..
English
{ "index": 300, "hashtag": "microsoftfail", "segmentation": "microsoft fail", "alternatives": { "segmentation": [ "Microsoft fail" ] } }
Although segmentation has exactly the same characters as hashtag except for the spaces, the segmentations inside alternatives may have characters corrected to uppercase.
All hashtag segmentation and identifier splitting datasets on this profile have the same basic fields: hashtag and segmentation or identifier and segmentation .
The only difference between hashtag and segmentation or between identifier and segmentation are the whitespace characters. Spell checking, expanding abbreviations or correcting characters to uppercase go into other fields.
There is always whitespace between an alphanumeric character and a sequence of any special characters ( such as _ , : , ~ ).
If there are any annotations for named entity recognition and other token classification tasks, they are given in a spans field.
@misc{bansal2015deep, title={Towards Deep Semantic Analysis Of Hashtags}, author={Piyush Bansal and Romil Bansal and Vasudeva Varma}, year={2015}, eprint={1501.03210}, archivePrefix={arXiv}, primaryClass={cs.IR} }
This dataset was added by @ruanchaves while developing the hashformers library.