ProtT5-XL-UniRef50模型

该模型是使用掩码语言建模（MLM）目标在蛋白质序列上进行预训练的。它是在 this paper 中被介绍并首次发布于 this repository 。该模型仅接受大写氨基酸：它只能处理大写字母的氨基酸。

模型描述

ProtT5-XL-UniRef50基于t5-3b模型，并在大规模蛋白质序列语料库上进行了自监督方式的预训练。这意味着它仅在原始蛋白质序列上进行了预训练，而没有以任何方式人工标记它们（这就是为什么它可以使用大量的公开数据），预训练过程中使用自动生成的输入和标签。

该T5模型与原始T5版本之间的一个重要区别是去噪目标。原始的T5-3B模型是使用区间去噪目标进行预训练的，而这个模型是使用类似Bart的MLM去噪目标进行预训练的。掩码概率与原始T5的训练一致，即在输入中随机掩盖15%的氨基酸。

已经证明，从这个自监督模型（LM-嵌入）提取的特征捕捉到了决定蛋白质结构形状的重要生物物理特性。这意味着它学习到了生命语言中的一些语法，体现在蛋白质序列中。

使用目的和限制

该模型可用于蛋白质特征提取或在后续任务上进行微调。我们注意到在某些任务中，通过微调模型而不是将其作为特征提取器使用，可以获得更高的准确性。我们还注意到，对于特征提取，最好使用从编码器中提取的特征而不是解码器。

使用方法

以下是如何在PyTorch中使用该模型提取给定蛋白质序列的特征：

sequence_examples = ["PRTEINO", "SEQWENCE"]
# this will replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

# generate embeddings
with torch.no_grad():
    embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)

# extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7]) 
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
# do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)

print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")

训练数据

ProtT5-XL-UniRef50模型是在 UniRef50 上进行的预训练，该数据集包含4500万个蛋白质序列。

训练过程

预处理

蛋白质序列被转换为大写并使用单个空格进行分词，并使用21个词汇量。稀有氨基酸"U,Z,O,B"被映射为"X"。然后，模型的输入形式如下：

Protein Sequence [EOS]

预处理步骤是动态进行的，通过将蛋白质序列截断和填充至512个标记。

每个序列的掩码过程的详细信息如下：

15%的氨基酸被掩盖。
在90%的情况下，被掩盖的氨基酸被替换为[MASK]标记。
在10%的情况下，被掩盖的氨基酸被随机（不同的）氨基酸替换。

预训练

模型在单个TPU Pod V2-256上进行了总共991.5k步的训练，使用序列长度为512（批大小2k）。预训练使用ProtT5-XL-BFD模型作为初始检查点进行，而不是从头开始训练。模型总共约有30亿个参数，并使用编码器-解码器架构进行训练。优化器使用AdaFactor，并采用反平方根学习率调度进行预训练。

评估结果

当该模型用于特征提取时，该模型实现了以下结果：

测试结果：

Task/Dataset	secondary structure (3-states)	secondary structure (8-states)
CASP12	81	70
TS115	87	77
CB513	86	74
DeepLoc	81	91

BibTeX引用和引用信息

@article {Elnaggar2020.07.12.199554,
    author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
    title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
    elocation-id = {2020.07.12.199554},
    year = {2020},
    doi = {10.1101/2020.07.12.199554},
    publisher = {Cold Spring Harbor Laboratory},
    abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
    URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
    eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
    journal = {bioRxiv}
}

创建者： Ahmed Elnaggar/@Elnaggar_AI | LinkedIn

作者:

Rostlab

数据集大小:

31.5 GB