数据集:

qanastek/HoC

语言:

en

大小:

1K<n<10K

语言创建人:

found

源数据集:

original
英文

HoC:癌症标志物预料库

数据集概要

用于文本分类的HoC:癌症标志物预料库

HoC(Hallmarks of Cancer)预料库由1852个PubMed出版物摘要手动进行了专家注释,根据一个分类法进行了标记。分类法包含37个层次结构中的类别。对于预料库中的每个句子,可以分配一个或多个类别标签。在“标签”目录下可以找到这些标签,而经过分词的文本则可以在“文本”目录下找到。文件名为相应的PubMed ID(PMID)。

除了HoC预料库外,我们还有一个将整个PubMed按照HoC分类法进行分类的数据集( Cancer Hallmarks Analytics Tool )。

支持的任务和排行榜

此数据集可用于训练多类别分类模型。

语言

该预料库仅包含英文的PubMed文章:

  • 英语-美国(en-US)

使用HuggingFace加载数据集

from datasets import load_dataset
dataset = load_dataset("qanastek/HoC")
validation = dataset["validation"]
print("First element of the validation set : ", validation[0])

数据集结构

数据实例

{
  "document_id": "12634122_5",
  "text": "Genes that were overexpressed in OM3 included oncogenes , cell cycle regulators , and those involved in signal transduction , whereas genes for DNA repair enzymes and inhibitors of transformation and metastasis were suppressed .",
  "label": [9, 5, 0, 6]
}

数据字段

document_id:文档的唯一标识符。

text:PubMed摘要的原始文本。

label:目前已知的10种癌症标志物之一。

Hallmark Search term
1. Sustaining proliferative signaling (PS) Proliferation Receptor Cancer
'Growth factor' Cancer
'Cell cycle' Cancer
2. Evading growth suppressors (GS) 'Cell cycle' Cancer
'Contact inhibition'
3. Resisting cell death (CD) Apoptosis Cancer
Necrosis Cancer
Autophagy Cancer
4. Enabling replicative immortality (RI) Senescence Cancer
Immortalization Cancer
5. Inducing angiogenesis (A) Angiogenesis Cancer
'Angiogenic factor'
6. Activating invasion & metastasis (IM) Metastasis Invasion Cancer
7. Genome instability & mutation (GI) Mutation Cancer
'DNA repair' Cancer
Adducts Cancer
'Strand breaks' Cancer
'DNA damage' Cancer
8. Tumor-promoting inflammation (TPI) Inflammation Cancer
'Oxidative stress' Cancer
Inflammation 'Immune response' Cancer
9. Deregulating cellular energetics (CE) Glycolysis Cancer; 'Warburg effect' Cancer
10. Avoiding immune destruction (ID) 'Immune system' Cancer
Immunosuppression Cancer

数据拆分

10种癌症标志物数据的分布情况:

Hallmark No. abstracts No. sentences
1. PS 462 993
2. GS 242 468
3. CD 430 883
4. RI 115 295
5. A 143 357
6. IM 291 667
7. GI 333 771
8. TPI 194 437
9. CE 105 213
10. ID 108 226

数据集创建

源数据

Who are the source language producers?

该预料库由Baker Simon、Silins Ilona、Guo Yufan、Ali Imran、Hogberg Johan和Stenius Ulla以及Korhonen Anna制作并上传。

个人和敏感信息

此预料库不包含个人或敏感信息。

附加信息

数据集策展人

HoC:Baker Simon、Silins Ilona、Guo Yufan、Ali Imran、Hogberg Johan、Stenius Ulla和Korhonen Anna

Hugging Face:Labrak Yanis(与原始预料库无关)

许可信息

GNU General Public License v3.0
Permissions
- Commercial use
- Modification
- Distribution
- Patent use
- Private use
Limitations
- Liability
- Warranty
Conditions
- License and copyright notice
- State changes
- Disclose source
- Same license

引用信息

如果您引用了我们的出版物,我们将非常感激:

Automatic semantic classification of scientific literature according to the hallmarks of cancer

@article{baker2015automatic,
  title={Automatic semantic classification of scientific literature according to the hallmarks of cancer},
  author={Baker, Simon and Silins, Ilona and Guo, Yufan and Ali, Imran and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna},
  journal={Bioinformatics},
  volume={32},
  number={3},
  pages={432--440},
  year={2015},
  publisher={Oxford University Press}
}

Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer

@article{baker2017cancer,
  title={Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer},
  author={Baker, Simon and Ali, Imran and Silins, Ilona and Pyysalo, Sampo and Guo, Yufan and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna},
  journal={Bioinformatics},
  volume={33},
  number={24},
  pages={3973--3981},
  year={2017},
  publisher={Oxford University Press}
}

Cancer hallmark text classification using convolutional neural networks

@article{baker2017cancer,
  title={Cancer hallmark text classification using convolutional neural networks},
  author={Baker, Simon and Korhonen, Anna-Leena and Pyysalo, Sampo},
  year={2016}
}

Initializing neural networks for hierarchical multi-label text classification

@article{baker2017initializing,
  title={Initializing neural networks for hierarchical multi-label text classification},
  author={Baker, Simon and Korhonen, Anna},
  journal={BioNLP 2017},
  pages={307--315},
  year={2017}
}