ProGene 数据集卡片

蛋白质/基因语料库是在JULIE实验室Jena的Udo Hahn教授的监督下开发的。执行科学家是Joachim Wermter博士。主要注释者是生物学专家Rico Pusch博士。该语料库是在StemNet项目 ( http://www.stemnet.de/ ) 的背景下开发的。

引用信息

@inproceedings{faessler-etal-2020-progene,
    title = "{P}ro{G}ene - A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus",
    author = "Faessler, Erik  and
      Modersohn, Luise  and
      Lohr, Christina  and
      Hahn, Udo",
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.lrec-1.564",
    pages = "4585--4596",
    abstract = "Genes and proteins constitute the fundamental entities of molecular genetics. We here introduce ProGene (formerly called FSU-PRGE), a corpus that reflects our efforts to cope with this important class of named entities within the framework of a long-lasting large-scale annotation campaign at the Jena University Language {\&} Information Engineering (JULIE) Lab. We assembled the entire corpus from 11 subcorpora covering various biological domains to achieve an overall subdomain-independent corpus. It consists of 3,308 MEDLINE abstracts with over 36k sentences and more than 960k tokens annotated with nearly 60k named entity mentions. Two annotators strove for carefully assigning entity mentions to classes of genes/proteins as well as families/groups, complexes, variants and enumerations of those where genes and proteins are represented by a single class. The main purpose of the corpus is to provide a large body of consistent and reliable annotations for supervised training and evaluation of machine learning algorithms in this relevant domain. Furthermore, we provide an evaluation of two state-of-the-art baseline systems {---} BioBert and flair {---} on the ProGene corpus. We make the evaluation datasets and the trained models available to encourage comparable evaluations of new methods in the future.",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

作者:

bigbio

数据集大小:

20.72 MB