数据集:
tner/bc5cdr
在 TNER 项目中使用的 BioCreative V CDR NER 数据集。原始数据集包含较长的文档,由于长度过长,无法输入到语言模型中,因此我们将其拆分为句子,以减小其大小。
train 的一个示例如下。
{ 'tags': [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0], 'tokens': ['Fasciculations', 'in', 'six', 'areas', 'of', 'the', 'body', 'were', 'scored', 'from', '0', 'to', '3', 'and', 'summated', 'as', 'a', 'total', 'fasciculation', 'score', '.'] }
标签到ID的字典可以在 here 处找到。
{ "O": 0, "B-Chemical": 1, "B-Disease": 2, "I-Disease": 3, "I-Chemical": 4 }
name | train | validation | test |
---|---|---|---|
bc5cdr | 5228 | 5330 | 5865 |
@article{wei2016assessing, title={Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task}, author={Wei, Chih-Hsuan and Peng, Yifan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J and Li, Jiao and Wiegers, Thomas C and Lu, Zhiyong}, journal={Database}, volume={2016}, year={2016}, publisher={Oxford Academic} }