ProtT5-XL-BFD模型

使用掩码语言建模（MLM）目标对蛋白质序列进行预训练的预训练模型。该模型在 this paper 中首次发布，首次发布在 this repository 中。该模型基于大写字母的氨基酸进行训练：它只适用于大写字母的氨基酸。

模型描述

ProtT5-XL-BFD基于t5-3b模型，并以自监督的方式对大量蛋白质序列进行预训练。这意味着它仅使用蛋白质序列的原始数据进行预训练，没有人类以任何方式对其进行标记（这就是它可以使用大量公开可用数据的原因），并通过自动生成输入和标签的过程对这些蛋白质序列进行预训练。

这个T5模型与原始T5版本之间一个重要的区别在于去噪目标。原始的T5-3B模型是使用跨度去噪目标进行预训练的，而这个模型是使用类似Bart的MLM去噪目标进行预训练的。屏蔽概率与原始T5训练一致，随机屏蔽输入中的15%氨基酸。

已经表明，从这个自监督模型（LM-embeddings）中提取的特征捕捉到了控制蛋白质形状的重要生物物理性质。这意味着它学习到了蛋白质序列中体现的生命语言的一些语法。

预期用途和限制

该模型可用于蛋白质特征提取或在下游任务中进行微调。我们注意到，在某些任务中，通过微调模型而不是将其作为特征提取器使用，可以获得更高的准确性。我们还注意到，在特征提取中，最好使用从编码器中提取的特征而不是解码器中的特征。

如何使用

下面是如何在PyTorch中使用该模型提取给定蛋白质序列的特征的示例：

from transformers import T5Tokenizer, T5Model
import re
import torch

tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_bfd', do_lower_case=False)

model = T5Model.from_pretrained("Rostlab/prot_t5_xl_bfd")

sequences_Example = ["A E T C Z A O","S K T Z P"]

sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]

ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True)

input_ids = torch.tensor(ids['input_ids'])
attention_mask = torch.tensor(ids['attention_mask'])

with torch.no_grad():
    embedding = model(input_ids=input_ids,attention_mask=attention_mask,decoder_input_ids=None)

# For feature extraction we recommend to use the encoder embedding
encoder_embedding = embedding[2].cpu().numpy()
decoder_embedding = embedding[0].cpu().numpy()

训练数据

ProtT5-XL-BFD模型是在 BFD 上进行预训练的，该数据集包含210亿个蛋白质序列。

训练过程

预处理

蛋白质序列被转换为大写并使用一个空格进行分词，词汇表大小为21。罕见的氨基酸“U、Z、O、B”被映射为“X”。然后，模型的输入格式为：

Protein Sequence [EOS]

预处理步骤是通过实时裁剪和填充蛋白质序列直到512个标记来执行的。

每个序列的掩码过程的详细信息如下：

15%的氨基酸被屏蔽。
在90%的情况下，被屏蔽的氨基酸被 [MASK] 标记替换。
在10%的情况下，被屏蔽的氨基酸被替换为与其不同的随机氨基酸。

预训练

该模型在单个TPU Pod V3-1024上进行了总共120万个步骤的训练，使用的序列长度为512（批量大小为4k）。它总共有约30亿个参数，并且使用编码器-解码器架构进行了预训练。预训练使用AdaFactor优化器和反比例平方学习率调度。

评估结果

当该模型用于特征提取时，该模型实现了以下结果：

测试结果：

Task/Dataset	secondary structure (3-states)	secondary structure (8-states)
CASP12	77	66
TS115	85	74
CB513	84	71
DeepLoc	77	91

BibTeX条目和引文信息

@article {Elnaggar2020.07.12.199554,
    author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
    title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
    elocation-id = {2020.07.12.199554},
    year = {2020},
    doi = {10.1101/2020.07.12.199554},
    publisher = {Cold Spring Harbor Laboratory},
    abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
    URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
    eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
    journal = {bioRxiv}
}

由 Ahmed Elnaggar/@Elnaggar_AI 创建 | LinkedIn

作者:

Rostlab

数据集大小:

21 GB