数据集:

tner/tweebank_ner

英文

"tner/tweebank_ner" 数据集卡片

数据集概要

TweeBank NER 数据集是 TNER 项目的一部分格式化数据集。

  • 实体类型: LOC, MISC, PER, ORG

数据集结构

数据实例

train 的一个示例如下所示。

{
    'tokens': ['RT', '@USER2362', ':', 'Farmall', 'Heart', 'Of', 'The', 'Holidays', 'Tabletop', 'Christmas', 'Tree', 'With', 'Lights', 'And', 'Motion', 'URL1087', '#Holiday', '#Gifts'],
    'tags': [8, 8, 8, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
}

标签 ID

label2id 字典可以在 here 处找到。

{
    "B-LOC": 0,
    "B-MISC": 1,
    "B-ORG": 2,
    "B-PER": 3,
    "I-LOC": 4,
    "I-MISC": 5,
    "I-ORG": 6,
    "I-PER": 7,
    "O": 8
}

数据拆分

name train validation test
tweebank_ner 1639 710 1201

引用信息

@article{DBLP:journals/corr/abs-2201-07281,
  author    = {Hang Jiang and
               Yining Hua and
               Doug Beeferman and
               Deb Roy},
  title     = {Annotating the Tweebank Corpus on Named Entity Recognition and Building
               {NLP} Models for Social Media Analysis},
  journal   = {CoRR},
  volume    = {abs/2201.07281},
  year      = {2022},
  url       = {https://arxiv.org/abs/2201.07281},
  eprinttype = {arXiv},
  eprint    = {2201.07281},
  timestamp = {Fri, 21 Jan 2022 13:57:15 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2201-07281.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}