数据集:
bigbio/nlm_gene
NLM-Gene 包含了550篇来自156种期刊的 PubMed 文章,涵盖了超过1.5万个独特的基因名称,对应于5000多个基因标识符(NCBI Gene 分类)。这个语料库包含来自28个生物体的基因注释数据。被注释的文章平均每篇含有29个基因名称和10个基因标识符。这些特征表明,这个文章集是一个重要的基准数据集,可以用来测试基因识别算法在多物种和歧义数据上的准确性。NLM-Gene 语料库对于推动生物医学文本中的基因识别任务的文本挖掘技术将是宝贵的。
@article{islamaj2021nlm, title = { NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition }, author = { Islamaj, Rezarta and Wei, Chih-Hsuan and Cissel, David and Miliaras, Nicholas and Printseva, Olga and Rodionov, Oleg and Sekiya, Keiko and Ward, Janice and Lu, Zhiyong }, year = 2021, journal = {Journal of Biomedical Informatics}, publisher = {Elsevier}, volume = 118, pages = 103779 }