英文

仅编码器 ProtT5-XL-UniRef50,半精度模型

这是一个仅包含编码器的半精度版本的 ProtT5-XL-UniRef50 模型。原始模型及其预训练在 this paper 中介绍,并于 this repository 首次发布。该模型基于大写氨基酸进行训练:它只能处理大写的氨基酸。

模型描述

ProtT5-XL-UniRef50基于t5-3b模型,并在大规模的蛋白质序列语料库上进行了自监督训练。这意味着它只使用原始蛋白质序列进行预训练,没有进行任何人工标注(这也是为什么它可以使用大量的公开可用数据),使用自动化过程从这些蛋白质序列中生成输入和标签。

这个T5模型与原始T5版本之间一个重要的区别是去噪目标。原始的T5-3B模型是通过一个跨度去噪目标进行预训练的,而这个模型是通过类似于Bart的MLM去噪目标进行预训练的。掩码概率与原始T5训练一致,随机掩盖输入中15%的氨基酸。

这个模型只包含了原始ProtT5-XL-UniRef50模型的编码器部分,使用了半精度(float16)。因此,这个模型可以高效地用于创建蛋白质/氨基酸表示。在用于训练下游网络/特征提取时,这些嵌入产生了相同的性能(通过在几个下游任务上进行对比得到)。

使用目的和限制

此版本的原始ProtT5-XL-UniRef50主要用于方便地创建具有较低GPU内存占用的氨基酸或蛋白质嵌入,而且在我们的实验中性能没有明显下降。这个模型可以在8GB视频内存上完全使用。

如何使用

关于如何使用该模型进行常见任务的详细交互式示例可以在 on Google Colab 中找到。

以下是如何在PyTorch中提取给定蛋白质序列的特征的示例:

sequence_examples = ["PRTEINO", "SEQWENCE"]
# this will replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

# generate embeddings
with torch.no_grad():
    embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)

# extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7]) 
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
# do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)

print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")

注意:请确保将模型显式设置为float16(T5EncoderModel.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', torch_dtype=torch.float16)),否则生成的嵌入将是全精度。

注意:当前(2022年6月)半精度模型不能在CPU上使用。如果要在CPU上使用仅编码器版本,需要将其转换为完全精度的版本(model=model.float())。

BibTeX引用和引文信息

@article {Elnaggar2020.07.12.199554,
    author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
    title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
    elocation-id = {2020.07.12.199554},
    year = {2020},
    doi = {10.1101/2020.07.12.199554},
    publisher = {Cold Spring Harbor Laboratory},
    abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \<a href="https://github.com/agemagician/ProtTrans"\>https://github.com/agemagician/ProtTrans\</a\>Competing Interest StatementThe authors have declared no competing interest.},
    URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
    eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
    journal = {bioRxiv}
}