用于文本分类的HoC:癌症标志物预料库
HoC(Hallmarks of Cancer)预料库由1852个PubMed出版物摘要手动进行了专家注释,根据一个分类法进行了标记。分类法包含37个层次结构中的类别。对于预料库中的每个句子,可以分配一个或多个类别标签。在“标签”目录下可以找到这些标签,而经过分词的文本则可以在“文本”目录下找到。文件名为相应的PubMed ID(PMID)。
除了HoC预料库外,我们还有一个将整个PubMed按照HoC分类法进行分类的数据集( Cancer Hallmarks Analytics Tool )。
此数据集可用于训练多类别分类模型。
该预料库仅包含英文的PubMed文章:
from datasets import load_dataset dataset = load_dataset("qanastek/HoC") validation = dataset["validation"] print("First element of the validation set : ", validation[0])
{ "document_id": "12634122_5", "text": "Genes that were overexpressed in OM3 included oncogenes , cell cycle regulators , and those involved in signal transduction , whereas genes for DNA repair enzymes and inhibitors of transformation and metastasis were suppressed .", "label": [9, 5, 0, 6] }
document_id:文档的唯一标识符。
text:PubMed摘要的原始文本。
label:目前已知的10种癌症标志物之一。
Hallmark | Search term |
---|---|
1. Sustaining proliferative signaling (PS) | Proliferation Receptor Cancer |
'Growth factor' Cancer | |
'Cell cycle' Cancer | |
2. Evading growth suppressors (GS) | 'Cell cycle' Cancer |
'Contact inhibition' | |
3. Resisting cell death (CD) | Apoptosis Cancer |
Necrosis Cancer | |
Autophagy Cancer | |
4. Enabling replicative immortality (RI) | Senescence Cancer |
Immortalization Cancer | |
5. Inducing angiogenesis (A) | Angiogenesis Cancer |
'Angiogenic factor' | |
6. Activating invasion & metastasis (IM) | Metastasis Invasion Cancer |
7. Genome instability & mutation (GI) | Mutation Cancer |
'DNA repair' Cancer | |
Adducts Cancer | |
'Strand breaks' Cancer | |
'DNA damage' Cancer | |
8. Tumor-promoting inflammation (TPI) | Inflammation Cancer |
'Oxidative stress' Cancer | |
Inflammation 'Immune response' Cancer | |
9. Deregulating cellular energetics (CE) | Glycolysis Cancer; 'Warburg effect' Cancer |
10. Avoiding immune destruction (ID) | 'Immune system' Cancer |
Immunosuppression Cancer |
10种癌症标志物数据的分布情况:
Hallmark | No. abstracts | No. sentences |
---|---|---|
1. PS | 462 | 993 |
2. GS | 242 | 468 |
3. CD | 430 | 883 |
4. RI | 115 | 295 |
5. A | 143 | 357 |
6. IM | 291 | 667 |
7. GI | 333 | 771 |
8. TPI | 194 | 437 |
9. CE | 105 | 213 |
10. ID | 108 | 226 |
该预料库由Baker Simon、Silins Ilona、Guo Yufan、Ali Imran、Hogberg Johan和Stenius Ulla以及Korhonen Anna制作并上传。
此预料库不包含个人或敏感信息。
HoC:Baker Simon、Silins Ilona、Guo Yufan、Ali Imran、Hogberg Johan、Stenius Ulla和Korhonen Anna
Hugging Face:Labrak Yanis(与原始预料库无关)
GNU General Public License v3.0
Permissions - Commercial use - Modification - Distribution - Patent use - Private use Limitations - Liability - Warranty Conditions - License and copyright notice - State changes - Disclose source - Same license
如果您引用了我们的出版物,我们将非常感激:
Automatic semantic classification of scientific literature according to the hallmarks of cancer
@article{baker2015automatic, title={Automatic semantic classification of scientific literature according to the hallmarks of cancer}, author={Baker, Simon and Silins, Ilona and Guo, Yufan and Ali, Imran and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna}, journal={Bioinformatics}, volume={32}, number={3}, pages={432--440}, year={2015}, publisher={Oxford University Press} }
@article{baker2017cancer, title={Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer}, author={Baker, Simon and Ali, Imran and Silins, Ilona and Pyysalo, Sampo and Guo, Yufan and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna}, journal={Bioinformatics}, volume={33}, number={24}, pages={3973--3981}, year={2017}, publisher={Oxford University Press} }
Cancer hallmark text classification using convolutional neural networks
@article{baker2017cancer, title={Cancer hallmark text classification using convolutional neural networks}, author={Baker, Simon and Korhonen, Anna-Leena and Pyysalo, Sampo}, year={2016} }
Initializing neural networks for hierarchical multi-label text classification
@article{baker2017initializing, title={Initializing neural networks for hierarchical multi-label text classification}, author={Baker, Simon and Korhonen, Anna}, journal={BioNLP 2017}, pages={307--315}, year={2017} }