数据集:

tner/bionlp2004

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

许可:

other
英文

"tner/bionlp2004" 数据集卡片

数据集概述

TNER 项目中格式化的 BioNLP2004 NER 数据集。BioNLP2004 数据集只包含训练集和测试集,因此我们从训练集中随机抽取一半大小的测试实例来创建验证集。

  • 实体类型:DNA,蛋白质,细胞类型,细胞系,RNA

数据集结构

数据实例

train 的一个示例如下。

{
    'tags': [0, 0, 0, 0, 3, 0, 9, 10, 0, 0, 0, 0, 0, 7, 8, 0, 3, 0, 0, 9, 10, 10, 0, 0],
    'tokens': ['In', 'the', 'presence', 'of', 'Epo', ',', 'c-myb', 'mRNA', 'declined', 'and', '20', '%', 'of', 'K562', 'cells', 'synthesized', 'Hb', 'regardless', 'of', 'antisense', 'myb', 'RNA', 'expression', '.']
}

标签编号

label2id 字典可以在 here 中找到。

{
    "O": 0,
    "B-DNA": 1,
    "I-DNA": 2,
    "B-protein": 3,
    "I-protein": 4,
    "B-cell_type": 5,
    "I-cell_type": 6,
    "B-cell_line": 7,
    "I-cell_line": 8,
    "B-RNA": 9,
    "I-RNA": 10
}

数据拆分

name train validation test
bionlp2004 16619 1927 3856

引用信息

@inproceedings{collier-kim-2004-introduction,
    title = "Introduction to the Bio-entity Recognition Task at {JNLPBA}",
    author = "Collier, Nigel  and
      Kim, Jin-Dong",
    booktitle = "Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications ({NLPBA}/{B}io{NLP})",
    month = aug # " 28th and 29th",
    year = "2004",
    address = "Geneva, Switzerland",
    publisher = "COLING",
    url = "https://aclanthology.org/W04-1213",
    pages = "73--78",
}