"tner/bc5cdr" 数据集卡片

数据集概述

在 TNER 项目中使用的 BioCreative V CDR NER 数据集。原始数据集包含较长的文档，由于长度过长，无法输入到语言模型中，因此我们将其拆分为句子，以减小其大小。

实体类型: 化学物质 , 疾病

数据集结构

数据实例

train 的一个示例如下。

{
    'tags': [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0],
    'tokens': ['Fasciculations', 'in', 'six', 'areas', 'of', 'the', 'body', 'were', 'scored', 'from', '0', 'to', '3', 'and', 'summated', 'as', 'a', 'total', 'fasciculation', 'score', '.']
}

标签ID

标签到ID的字典可以在 here 处找到。

{
    "O": 0,
    "B-Chemical": 1,
    "B-Disease": 2,
    "I-Disease": 3,
    "I-Chemical": 4
}

数据拆分

name	train	validation	test
bc5cdr	5228	5330	5865

引用信息

@article{wei2016assessing,
  title={Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task},
  author={Wei, Chih-Hsuan and Peng, Yifan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J and Li, Jiao and Wiegers, Thomas C and Lu, Zhiyong},
  journal={Database},
  volume={2016},
  year={2016},
  publisher={Oxford Academic}
}

作者:

tner

数据集大小:

4.15 MB