数据集:

ruanchaves/test_stanford

语言:

en

计算机处理:

monolingual

语言创建人:

machine-generated

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:1501.03210
中文

Dataset Card for Test-Stanford

Dataset Summary

Manually Annotated Stanford Sentiment Analysis Dataset by Bansal et al..

Languages

English

Dataset Structure

Data Instances

{
    "index": 1467856821,
    "hashtag": "therapyfail",
    "segmentation": "therapy fail",
    "gold_position": 8,
    "rank": {
        "position": [
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9,
            10,
            11,
            12,
            13,
            14,
            15,
            16,
            17,
            18,
            19,
            20
        ],
        "candidate": [
            "therap y fail",
            "the rap y fail",
            "t her apy fail",
            "the rap yfail",
            "t he rap y fail",
            "thera py fail",
            "ther apy fail",
            "th era py fail",
            "therapy fail",
            "therapy fai l",
            "the r apy fail",
            "the rapyfa il",
            "the rapy fail",
            "t herapy fail",
            "the rapyfail",
            "therapy f ai l",
            "therapy fa il",
            "the rapyf a il",
            "therapy f ail",
            "the ra py fail"
        ]
    }
}

Data Fields

  • index : a numerical index annotated by Kodali et al..
  • hashtag : the original hashtag.
  • segmentation : the gold segmentation for the hashtag.
  • gold_position : position of the gold segmentation on the segmentation field inside the rank .
  • rank : Rank of each candidate selected by a baseline word segmenter ( Segmentations Seeder Module ).

Dataset Creation

  • All hashtag segmentation and identifier splitting datasets on this profile have the same basic fields: hashtag and segmentation or identifier and segmentation .

  • The only difference between hashtag and segmentation or between identifier and segmentation are the whitespace characters. Spell checking, expanding abbreviations or correcting characters to uppercase go into other fields.

  • There is always whitespace between an alphanumeric character and a sequence of any special characters ( such as _ , : , ~ ).

  • If there are any annotations for named entity recognition and other token classification tasks, they are given in a spans field.

Additional Information

Citation Information

@misc{bansal2015deep,
      title={Towards Deep Semantic Analysis Of Hashtags}, 
      author={Piyush Bansal and Romil Bansal and Vasudeva Varma},
      year={2015},
      eprint={1501.03210},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

Contributions

This dataset was added by @ruanchaves while developing the hashformers library.