数据集:
EMBO/biolang
任务:
文本生成子任务:
language-modeling语言:
en计算机处理:
monolingual语言创建人:
expert-generated批注创建人:
machine-generated许可:
cc-by-4.0BioLang数据集基于EuropePubMed Central的开放获取部分的摘要,用于训练生物学领域的语言模型。该数据集可用于随机掩码语言建模,或仅使用特定的词性掩码语言建模。有关数据集生成和使用的更多详细信息,请参见 https://github.com/source-data/soda-roberta 。
英语
{ "input_ids":[ 0, 2444, 6997, 46162, 7744, 35, 20632, 20862, 3457, 36, 500, 23858, 29, 43, 32, 3919, 716, 15, 49, 4476, 4, 1398, 6, 52, 1118, 5, 20862, 819, 9, 430, 23305, 248, 23858, 29, 4, 256, 40086, 104, 35, 1927, 1069, 459, 1484, 58, 4776, 13, 23305, 634, 16706, 493, 2529, 8954, 14475, 73, 34263, 6, 4213, 718, 833, 12, 24291, 4473, 22500, 14475, 73, 510, 705, 73, 34263, 6, 5143, 4313, 2529, 8954, 14475, 73, 34263, 6, 8, 5143, 4313, 2529, 8954, 14475, 248, 23858, 29, 23, 4448, 225, 4722, 2392, 11, 9341, 261, 4, 49043, 35, 96, 746, 6, 5962, 9, 38415, 4776, 408, 36, 3897, 4, 398, 8871, 56, 23305, 4, 20, 15608, 21, 8061, 6164, 207, 13, 70, 248, 23858, 29, 6, 150, 5, 42561, 21, 8061, 5663, 207, 13, 80, 3457, 4, 509, 1296, 5129, 21567, 3457, 36, 398, 23528, 8748, 22065, 11654, 35, 7253, 15, 49, 4476, 6, 70, 3457, 4682, 65, 189, 28, 5131, 13, 23305, 9726, 4, 2 ], "label_ids": [ "X", "NOUN", "NOUN", "NOUN", "NOUN", "PUNCT", "ADJ", "ADJ", "NOUN", "PUNCT", "PROPN", "PROPN", "PROPN", "PUNCT", "AUX", "VERB", "VERB", "ADP", "DET", "NOUN", "PUNCT", "ADV", "PUNCT", "PRON", "VERB", "DET", "ADJ", "NOUN", "ADP", "ADJ", "NOUN", "NOUN", "NOUN", "NOUN", "PUNCT", "ADJ", "ADJ", "ADJ", "PUNCT", "NOUN", "NOUN", "NOUN", "NOUN", "AUX", "VERB", "ADP", "NOUN", "VERB", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "SYM", "PROPN", "PUNCT", "PROPN", "PROPN", "PROPN", "PUNCT", "PROPN", "PROPN", "PROPN", "PROPN", "SYM", "PROPN", "PROPN", "SYM", "PROPN", "PUNCT", "PROPN", "PROPN", "PROPN", "PROPN", "PROPN", "SYM", "PROPN", "PUNCT", "CCONJ", "ADJ", "PROPN", "PROPN", "PROPN", "PROPN", "NOUN", "NOUN", "NOUN", "ADP", "PROPN", "PROPN", "PROPN", "PROPN", "ADP", "PROPN", "PROPN", "PUNCT", "PROPN", "PUNCT", "ADP", "NOUN", "PUNCT", "NUM", "ADP", "NUM", "VERB", "NOUN", "PUNCT", "NUM", "NUM", "NUM", "NOUN", "AUX", "NOUN", "PUNCT", "DET", "NOUN", "AUX", "X", "NUM", "NOUN", "ADP", "DET", "NOUN", "NOUN", "NOUN", "PUNCT", "SCONJ", "DET", "NOUN", "AUX", "X", "NUM", "NOUN", "ADP", "NUM", "NOUN", "PUNCT", "NUM", "NOUN", "VERB", "ADJ", "NOUN", "PUNCT", "NUM", "NOUN", "NOUN", "NOUN", "NOUN", "PUNCT", "VERB", "ADP", "DET", "NOUN", "PUNCT", "DET", "NOUN", "SCONJ", "PRON", "VERB", "AUX", "VERB", "ADP", "NOUN", "NOUN", "PUNCT", "X" ], "special_tokens_mask": [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ] }
MLM:
DET,VERB,SMALL:
该数据集被组装起来用于训练细胞和分子生物学领域的语言模型。为了扩大数据集的规模并包含许多具有高度技术语言的示例,从文章的摘要中提取了图例(或图表说明)。
2021年1月,从 EuropePMC 的开放获取部分下载了论文的xml内容。从JATS XML中提取了图例和摘要,使用roberta-base的tokenizer进行了标记化,并使用Spacy的en_core_web_sm模型( https://spacy.io )进行了词性标记。
更多详细信息请参见 https://github.com/source-data/soda-roberta
谁是源语言制片商?专家科学家们。
词性是自动标记的。
谁是注释者?词性标记使用了Spacy的en_core_web_sm模型( https://spacy.io )。
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
Thomas Lemberger
CC-BY 4.0
[需要更多信息]
感谢 @tlemberger 添加了此数据集。