数据集:

EMBO/biolang

语言:

en

计算机处理:

monolingual

语言创建人:

expert-generated

批注创建人:

machine-generated

许可:

cc-by-4.0
英文

BioLang数据集数据卡片

数据集简介

BioLang数据集基于EuropePubMed Central的开放获取部分的摘要,用于训练生物学领域的语言模型。该数据集可用于随机掩码语言建模,或仅使用特定的词性掩码语言建模。有关数据集生成和使用的更多详细信息,请参见 https://github.com/source-data/soda-roberta

支持的任务和排行榜

  • MLM:掩码语言建模
  • DET:词性掩码语言模型,带有标记为“DET”的限定词
  • SMALL:词性掩码语言模型,带有标记为“DET”,“CCONJ”,“SCONJ”,“ADP”,“PRON”的“small”单词
  • VERB:词性掩码语言模型,带有标记为“VERB”的动词

语言

英语

数据集结构

数据示例

{
    "input_ids":[
        0, 2444, 6997, 46162, 7744, 35, 20632, 20862, 3457, 36, 500, 23858, 29, 43, 32, 3919, 716, 15, 49, 4476, 4, 1398, 6, 52, 1118, 5, 20862, 819, 9, 430, 23305, 248, 23858, 29, 4, 256, 40086, 104, 35, 1927, 1069, 459, 1484, 58, 4776, 13, 23305, 634, 16706, 493, 2529, 8954, 14475, 73, 34263, 6, 4213, 718, 833, 12, 24291, 4473, 22500, 14475, 73, 510, 705, 73, 34263, 6, 5143, 4313, 2529, 8954, 14475, 73, 34263, 6, 8, 5143, 4313, 2529, 8954, 14475, 248, 23858, 29, 23, 4448, 225, 4722, 2392, 11, 9341, 261, 4, 49043, 35, 96, 746, 6, 5962, 9, 38415, 4776, 408, 36, 3897, 4, 398, 8871, 56, 23305, 4, 20, 15608, 21, 8061, 6164, 207, 13, 70, 248, 23858, 29, 6, 150, 5, 42561, 21, 8061, 5663, 207, 13, 80, 3457, 4, 509, 1296, 5129, 21567, 3457, 36, 398, 23528, 8748, 22065, 11654, 35, 7253, 15, 49, 4476, 6, 70, 3457, 4682, 65, 189, 28, 5131, 13, 23305, 9726, 4, 2
    ], 
    "label_ids": [
        "X", "NOUN", "NOUN", "NOUN", "NOUN", "PUNCT", "ADJ", "ADJ", "NOUN", "PUNCT", "PROPN", "PROPN", "PROPN", "PUNCT", "AUX", "VERB", "VERB", "ADP", "DET", "NOUN", "PUNCT", "ADV", "PUNCT", "PRON", "VERB", "DET", "ADJ", "NOUN", "ADP", "ADJ", "NOUN", "NOUN", "NOUN", "NOUN", "PUNCT", "ADJ", "ADJ", "ADJ", "PUNCT", "NOUN", "NOUN", "NOUN", "NOUN", "AUX", "VERB", "ADP", "NOUN", "VERB", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "SYM", "PROPN", "PUNCT", "PROPN", "PROPN", "PROPN", "PUNCT", "PROPN", "PROPN", "PROPN", "PROPN", "SYM", "PROPN", "PROPN", "SYM", "PROPN", "PUNCT", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "SYM", "PROPN", "PUNCT", "CCONJ", "ADJ", "PROPN", "PROPN", "PROPN", "PROPN", "NOUN", "NOUN", "NOUN", "ADP", "PROPN", "PROPN", "PROPN", "PROPN", "ADP", "PROPN", "PROPN", "PUNCT", "PROPN", "PUNCT", "ADP", "NOUN", "PUNCT", "NUM", "ADP", "NUM", "VERB", "NOUN", "PUNCT", "NUM", "NUM", "NUM", "NOUN", "AUX", "NOUN", "PUNCT", "DET", "NOUN", "AUX", "X", "NUM", "NOUN", "ADP", "DET", "NOUN", "NOUN", "NOUN", "PUNCT", "SCONJ", "DET", "NOUN", "AUX", "X", "NUM", "NOUN", "ADP", "NUM", "NOUN", "PUNCT", "NUM", "NOUN", "VERB", "ADJ", "NOUN", "PUNCT", "NUM", "NOUN", "NOUN", "NOUN", "NOUN", "PUNCT", "VERB", "ADP", "DET", "NOUN", "PUNCT", "DET", "NOUN", "SCONJ", "PRON", "VERB", "AUX", "VERB", "ADP", "NOUN", "NOUN", "PUNCT", "X"
    ], 
    "special_tokens_mask": [
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
    ]
}

数据字段

MLM:

  • input_ids:int32特征的列表。
  • special_tokens_mask:int8特征的列表。

DET,VERB,SMALL:

  • input_ids:int32特征的列表。
  • tag_mask:int8特征的列表。

数据拆分

  • train:
    • 特征:['input_ids','special_tokens_mask'],
    • num_rows:12_005_390
  • test:
    • 特征:['input_ids','special_tokens_mask'],
    • num_rows:37_112
  • validation:
    • 特征:['input_ids','special_tokens_mask'],
    • num_rows:36_713

数据集创建

策划理由

该数据集被组装起来用于训练细胞和分子生物学领域的语言模型。为了扩大数据集的规模并包含许多具有高度技术语言的示例,从文章的摘要中提取了图例(或图表说明)。

来源数据

初始数据收集和规范化

2021年1月,从 EuropePMC 的开放获取部分下载了论文的xml内容。从JATS XML中提取了图例和摘要,使用roberta-base的tokenizer进行了标记化,并使用Spacy的en_core_web_sm模型( https://spacy.io )进行了词性标记。

更多详细信息请参见 https://github.com/source-data/soda-roberta

谁是源语言制片商?

专家科学家们。

注释

注释过程

词性是自动标记的。

谁是注释者?

词性标记使用了Spacy的en_core_web_sm模型( https://spacy.io )。

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

Thomas Lemberger

许可信息

CC-BY 4.0

引用信息

[需要更多信息]

贡献

感谢 @tlemberger 添加了此数据集。