nucleotide-transformer-2.5b-multi-species 模型

核酸转换器是一系列基础语言模型，其在整个基因组的DNA序列上进行了预训练。与其他方法相比，我们的模型不仅集成了单个参考基因组的信息，还利用了超过3,200个不同人类基因组和850个来自各种物种（包括模型和非模型生物）的基因组的DNA序列。通过强大而广泛的评估，我们证明这些大型模型相比现有方法能够提供极高的分子表型预测准确性。

该系列中的一部分是 nucleotide-transformer-2.5b-multi-species，这是一个使用2.5B参数的transformer，在来自各种物种（包括模型和非模型生物）的850个基因组的集合上进行了预训练。该模型可在Tensorflow和Pytorch中使用。

开发者：InstaDeep、NVIDIA和TUM

模型来源

仓库： Nucleotide Transformer
论文： The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

如何使用

在下一次发布之前，需要通过以下命令从源代码中安装transformers库以使用这些模型：

pip install --upgrade git+https://github.com/huggingface/transformers.git

这里提供了一小段代码，用于从虚拟DNA序列中检索logits和嵌入。

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Import the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-2.5b-multi-species")
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-2.5b-multi-species")

# Create a dummy dna sequence and tokenize it
sequences = ['ATTCTG' * 9]
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt")["input_ids"]

# Compute the embeddings
attention_mask = tokens_ids != tokenizer.pad_token_id
torch_outs = model(
    tokens_ids,
    attention_mask=attention_mask,
    encoder_attention_mask=attention_mask,
    output_hidden_states=True
)

# Compute sequences embeddings
embeddings = torch_outs['hidden_states'][-1].detach().numpy()
print(f"Embeddings shape: {embeddings.shape}")
print(f"Embeddings per token: {embeddings}")

# Compute mean embeddings per sequence
mean_sequence_embeddings = torch.sum(attention_mask.unsqueeze(-1)*embeddings, axis=-2)/torch.sum(attention_mask, axis=-1)
print(f"Mean sequence embeddings: {mean_sequence_embeddings}")

训练数据

nucleotide-transformer-2.5b-multi-species 模型在共计850个基因组上进行了预训练，这些基因组来自 NCBI 。植物和病毒未包含在这些基因组中，因为它们的调控元件与论文任务中的元件不同。挑选了一些深入研究的模型生物，将其包括在基因组集合中，总共包含174B个核苷酸，即约为29B个令牌。该数据已作为HuggingFace数据集 here 发布。

训练过程

预处理

使用核酸转换器分词器对DNA序列进行分词，当可能时，将序列分词为6-mers分词，否则将每个核酸单独分词，如相关仓库的 Tokenization 部分所述。该分词器的词汇表大小为4105。然后，模型的输入的格式为：

<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>

分词后的序列最大长度为1,000个。

使用的掩码处理程序是Bert风格训练的标准处理程序：

15％的令牌被掩码。
在80％的情况下，掩码令牌被替换为[MASK]。
在10％的情况下，掩码令牌被随机替换为不同的令牌。
在剩余的10％情况下，掩码令牌保持不变。

预训练

该模型使用128个A100 80GB GPU对300B个令牌进行了训练，有效批量大小为1M个令牌。使用的序列长度为1,000个令牌。采用Adam优化器[38]，具有学习率计划和指数衰减率和epsilon常数的标准值，β1 = 0.9，β2 = 0.999和ε=1e-8.在第一次预热期间，学习率在16k步骤内从5e-5线性增加到1e-4，然后按照平方根衰减进行减少，直到训练结束。

BibTeX条目和引文信息

@article{dalla2023nucleotide,
  title={The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics},
  author={Dalla-Torre, Hugo and Gonzalez, Liam and Mendoza Revilla, Javier and Lopez Carranza, Nicolas and Henryk Grywaczewski, Adam and Oteri, Francesco and Dallago, Christian and Trop, Evan and Sirelkhatim, Hassan and Richard, Guillaume and others},
  journal={bioRxiv},
  pages={2023--01},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

作者:

InstaDeep Ltd

数据集大小:

18.98 GB