在本文中,我将尝试仅用230万参数制作一个LLM,有趣的是我们不需要一张高端的GPU。我们将遵循LLaMA 1论文的方法来指导我们。我们将保持简单并使用基础的数据集,这样你就能看到创建自己的百万参数级LLM是多么容易。
在深入研究使用LLaMA方法创建我们自己的LLM之前,理解LLaMA的架构至关重要。下面是vanilla transformer和LLaMA之间的对比图。
在LLaMA方法中,采用了一种名为RMSNorm的技术,用于规范化每个变换器子层的输入。这种方法受到了GPT-3的启发,旨在优化与层规范化(Layer Normalization)相关的计算成本。RMSNorm提供了与层规范化相似的性能,但显著减少了运行时间(降低了7%∼64%)。
除了这些概念之外,LLaMA 论文还介绍了其他重要的方法,包括使用带有特定参数的 AdamW 优化器、在 xformers 库中可用的高效实现,如因果多头注意力运算子,以及为了在反向传播过程中优化计算而手动实现的变换器层的后向函数。
# PyTorch for implementing LLM (No GPU)
import torch
# Neural network modules and functions from PyTorch
from torch import nn
from torch.nn import functional as F
# NumPy for numerical operations
import numpy as np
# Matplotlib for plotting Loss etc.
from matplotlib import pyplot as plt
# Time module for tracking execution time
import time
# Pandas for data manipulation and analysis
import pandas as pd
# urllib for handling URL requests (Downloading Dataset)
import urllib.request
# Configuration object for model parameters
# Adding parameters later
# The URL of the raw text file on GitHub
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
# The file name for local storage
file_name = "tinyshakespeare.txt"
# Execute the download
urllib.request.urlretrieve(url, file_name)
# Read the content of the dataset
lines = open("tinyshakespeare.txt", 'r').read()
# Create a sorted list of unique characters in the dataset
vocab = sorted(list(set(lines)))
# Display the first 10 characters in the vocabulary list
print('Printing the first 10 characters of the vocab list:', vocab[:10])
# Output the total number of characters in our dataset (Vocabulary Size)
print('Total number of characters in our dataset (Vocabulary Size):', len(vocab))
# Mapping integers to characters (itos)
itos = {i: ch for i, ch in enumerate(vocab)}
# Mapping characters to integers (stoi)
stoi = {ch: i for i, ch in enumerate(vocab)}
# Encode function: Converts a string to a list of integers using the mapping stoi
def encode(s):
return [stoi[ch] for ch in s]
# Decode function: Converts a list of integers back to a string using the mapping itos
def decode(l):
return ''.join([itos[i] for i in l])
# Example: Encode the string "hello" and then decode the result
# Convert the dataset into a torch tensor with specified data type (dtype)
dataset = torch.tensor(encode(lines), dtype=torch.int8)
# Display the shape of the resulting tensor
# Function to get batches for training, validation, or testing
def get_batches(data, split, batch_size, context_window, config=MASTER_CONFIG):
# Split the dataset into training, validation, and test sets
train = data[:int(.8 * len(data))]
val = data[int(.8 * len(data)): int(.9 * len(data))]
test = data[int(.9 * len(data)):]
# Determine which split to use
batch_data = train
if split == 'val':
batch_data = val
if split == 'test':
batch_data = test
# Pick random starting points within the data
ix = torch.randint(0, batch_data.size(0) - context_window - 1, (batch_size,))
# Create input sequences (x) and corresponding target sequences (y)
x = torch.stack([batch_data[i:i+context_window] for i in ix]).long()
y = torch.stack([batch_data[i+1:i+context_window+1] for i in ix]).long()
return x, y
# Update the MASTER_CONFIG with batch_size and context_window parameters
'batch_size': 8, # Number of batches to be processed at each random split
'context_window': 16 # Number of characters in each input (x) and target (y) sequence of each batch
batch_size 决定了在每个随机分割中处理多少个批次,而 context_window 指定了每个批次中每个输入(x)和目标(y)序列中的字符数量。
# Obtain batches for training using the specified batch size and context window
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])
# Decode the sequences to obtain the corresponding text representations
decoded_samples = [(decode(xs[i].tolist()), decode(ys[i].tolist())) for i in range(len(xs))]
# Print the random sample
@torch.no_grad() # Don't compute gradients for this function
def evaluate_loss(model, config=MASTER_CONFIG):
# Placeholder for the evaluation results
out = {}
# Set the model to evaluation mode
# Iterate through training and validation splits
for split in ["train", "val"]:
# Placeholder for individual losses
losses = []
# Generate 10 batches for evaluation
for _ in range(10):
# Get input sequences (xb) and target sequences (yb)
xb, yb = get_batches(dataset, split, config['batch_size'], config['context_window'])
# Perform model inference and calculate the loss
_, loss = model(xb, yb)
# Append the loss to the list
# Calculate the mean loss for the split and store it in the output dictionary
out[split] = np.mean(losses)
# Set the model back to training mode
return out
# Definition of a basic neural network class
class SimpleBrokenModel(nn.Module):
def __init__(self, config=MASTER_CONFIG):
self.config = config
# Embedding layer to convert character indices to vectors (vocab size: 65)
self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])
# Linear layers for modeling relationships between features
# (to be updated with SwiGLU activation function as in LLaMA)
self.linear = nn.Sequential(
nn.Linear(config['d_model'], config['d_model']),
nn.ReLU(), # Currently using ReLU, will be replaced with SwiGLU as in LLaMA
nn.Linear(config['d_model'], config['vocab_size']),
# Print the total number of model parameters
print("Model parameters:", sum([m.numel() for m in self.parameters()]))
# Definition of a basic neural network class
class SimpleBrokenModel(nn.Module):
def __init__(self, config=MASTER_CONFIG):
# Rest of the code
# Forward pass function for the base model
def forward(self, idx, targets=None):
# Embedding layer converts character indices to vectors
x = self.embedding(idx)
# Linear layers for modeling relationships between features
a = self.linear(x)
# Apply softmax activation to obtain probability distribution
logits = F.softmax(a, dim=-1)
# If targets are provided, calculate and return the cross-entropy loss
if targets is not None:
# Reshape logits and targets for cross-entropy calculation
loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))
return logits, loss
# If targets are not provided, return the logits
return logits
# Print the total number of model parameters
print("Model parameters:", sum([m.numel() for m in self.parameters()]))
# Update MASTER_CONFIG with the dimension of linear layers (128)
'd_model': 128,
# Instantiate the SimpleBrokenModel using the updated MASTER_CONFIG
model = SimpleBrokenModel(MASTER_CONFIG)
# Print the total number of parameters in the model
print("Total number of parameters in the Simple Neural Network Model:", sum([m.numel() for m in model.parameters()]))
# Obtain batches for training using the specified batch size and context window
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])
# Calculate logits and loss using the model
logits, loss = model(xs, ys)
# Update MASTER_CONFIG with training parameters
'epochs': 1000, # Number of training epochs
'log_interval': 10, # Log information every 10 batches during training
'batch_size': 32, # Increase batch size to 32
# Instantiate the SimpleBrokenModel with updated configuration
model = SimpleBrokenModel(MASTER_CONFIG)
# Define the Adam optimizer for model parameters
optimizer = torch.optim.Adam(
model.parameters(), # Pass the model parameters to the optimizer
# Function to perform training
def train(model, optimizer, scheduler=None, config=MASTER_CONFIG, print_logs=False):
# Placeholder for storing losses
losses = []
# Start tracking time
start_time = time.time()
# Iterate through epochs
for epoch in range(config['epochs']):
# Zero out gradients
# Obtain batches for training
xs, ys = get_batches(dataset, 'train', config['batch_size'], config['context_window'])
# Forward pass through the model to calculate logits and loss
logits, loss = model(xs, targets=ys)
# Backward pass and optimization step
# If a learning rate scheduler is provided, adjust the learning rate
if scheduler:
# Log progress every specified interval
if epoch % config['log_interval'] == 0:
# Calculate batch time
batch_time = time.time() - start_time
# Evaluate loss on validation set
x = evaluate_loss(model)
# Store the validation loss
losses += [x]
# Print progress logs if specified
if print_logs:
print(f"Epoch {epoch} | val loss {x['val']:.3f} | Time {batch_time:.3f} | ETA in seconds {batch_time * (config['epochs'] - epoch)/config['log_interval'] :.3f}")
# Reset the timer
start_time = time.time()
# Print learning rate if a scheduler is provided
if scheduler:
print("lr: ", scheduler.get_lr())
# Print the final validation loss
print("Validation loss: ", losses[-1]['val'])
# Plot the training and validation loss curves
return pd.DataFrame(losses).plot()
# Execute the training process
train(model, optimizer)
# Modified SimpleModel class without softmax layer
class SimpleModel(nn.Module):
def __init__(self, config):
# Rest of the code
def forward(self, idx, targets=None):
# Embedding layer converts character indices to vectors
x = self.embedding(idx)
# Linear layers for modeling relationships between features
logits = self.linear(x)
# If targets are provided, calculate and return the cross-entropy loss
if targets is not None:
# Rest of the code
# Create the updated SimpleModel
model = SimpleModel(MASTER_CONFIG)
# Obtain batches for training
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])
# Calculate logits and loss using the model
logits, loss = model(xs, ys)
# Define the Adam optimizer for model parameters
optimizer = torch.optim.Adam(model.parameters())
# Train the model for 100 epochs
train(model, optimizer)
# Generate function for text generation using the trained model
def generate(model, config=MASTER_CONFIG, max_new_tokens=30):
idx = torch.zeros(5, 1).long()
for _ in range(max_new_tokens):
# Call the model
logits = model(idx[:, -config['context_window']:])
last_time_step_logits = logits[
:, -1, :
] # all the batches (1), last time step, all the logits
p = F.softmax(last_time_step_logits, dim=-1) # softmax to get probabilities
idx_next = torch.multinomial(
p, num_samples=1
) # sample from the distribution to get the next token
idx = torch.cat([idx, idx_next], dim=-1) # append to the sequence
return [decode(x) for x in idx.tolist()]
# Generate text using the trained model
class RMSNorm(nn.Module):
def __init__(self, layer_shape, eps=1e-8, bias=False):
super(RMSNorm, self).__init__()
# Registering a learnable parameter 'scale' as a parameter of the module
self.register_parameter("scale", nn.Parameter(torch.ones(layer_shape)))
def forward(self, x):
Assumes shape is (batch, seq_len, d_model)
# Calculating the Frobenius norm, RMS = 1/sqrt(N) * Frobenius norm
ff_rms = torch.linalg.norm(x, dim=(1,2)) * x[0].numel() ** -.5
# Normalizing the input tensor 'x' with respect to RMS
raw = x / ff_rms.unsqueeze(-1).unsqueeze(-1)
# Scaling the normalized tensor using the learnable parameter 'scale'
return self.scale[:x.shape[1], :].unsqueeze(0) * raw
# Define the SimpleModel_RMS with RMSNorm
class SimpleModel_RMS(nn.Module):
def __init__(self, config):
self.config = config
# Embedding layer to convert character indices to vectors
self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])
# RMSNorm layer for pre-normalization
self.rms = RMSNorm((config['context_window'], config['d_model']))
# Linear layers for modeling relationships between features
self.linear = nn.Sequential(
# Rest of the code
# Print the total number of model parameters
print("Model parameters:", sum([m.numel() for m in self.parameters()]))
def forward(self, idx, targets=None):
# Embedding layer converts character indices to vectors
x = self.embedding(idx)
# RMSNorm pre-normalization
x = self.rms(x)
# Linear layers for modeling relationships between features
logits = self.linear(x)
if targets is not None:
# Rest of the code
# Create an instance of SimpleModel_RMS
model = SimpleModel_RMS(MASTER_CONFIG)
# Obtain batches for training
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])
# Calculate logits and loss using the model
logits, loss = model(xs, ys)
# Define the Adam optimizer for model parameters
optimizer = torch.optim.Adam(model.parameters())
# Train the model
train(model, optimizer)
def get_rotary_matrix(context_window, embedding_dim):
# Initialize a tensor for the rotary matrix with zeros
R = torch.zeros((context_window, embedding_dim, embedding_dim), requires_grad=False)
# Loop through each position in the context window
for position in range(context_window):
# Loop through each dimension in the embedding
for i in range(embedding_dim // 2):
# Calculate the rotation angle (theta) based on the position and embedding dimension
theta = 10000. ** (-2. * (i - 1) / embedding_dim)
# Calculate the rotated matrix elements using sine and cosine functions
m_theta = position * theta
R[position, 2 * i, 2 * i] = np.cos(m_theta)
R[position, 2 * i, 2 * i + 1] = -np.sin(m_theta)
R[position, 2 * i + 1, 2 * i] = np.sin(m_theta)
R[position, 2 * i + 1, 2 * i + 1] = np.cos(m_theta)
return R
class RoPEAttentionHead(nn.Module):
def __init__(self, config):
self.config = config
# Linear transformation for query
self.w_q = nn.Linear(config['d_model'], config['d_model'], bias=False)
# Linear transformation for key
self.w_k = nn.Linear(config['d_model'], config['d_model'], bias=False)
# Linear transformation for value
self.w_v = nn.Linear(config['d_model'], config['d_model'], bias=False)
# Obtain rotary matrix for positional embeddings
self.R = get_rotary_matrix(config['context_window'], config['d_model'])
def get_rotary_matrix(context_window, embedding_dim):
# Generate rotational matrix for RoPE
R = torch.zeros((context_window, embedding_dim, embedding_dim), requires_grad=False)
for position in range(context_window):
for i in range(embedding_dim//2):
# Rest of the code
return R
def forward(self, x, return_attn_weights=False):
# x: input tensor of shape (batch, sequence length, dimension)
b, m, d = x.shape # batch size, sequence length, dimension
# Linear transformations for Q, K, and V
q = self.w_q(x)
k = self.w_k(x)
v = self.w_v(x)
# Rotate Q and K using the RoPE matrix
q_rotated = (torch.bmm(q.transpose(0, 1), self.R[:m])).transpose(0, 1)
k_rotated = (torch.bmm(k.transpose(0, 1), self.R[:m])).transpose(0, 1)
# Perform scaled dot-product attention
activations = F.scaled_dot_product_attention(
q_rotated, k_rotated, v, dropout_p=0.1, is_causal=True
if return_attn_weights:
# Create a causal attention mask
attn_mask = torch.tril(torch.ones((m, m)), diagonal=0)
# Calculate attention weights and add causal mask
attn_weights = torch.bmm(q_rotated, k_rotated.transpose(1, 2)) / np.sqrt(d) + attn_mask
attn_weights = F.softmax(attn_weights, dim=-1)
return activations, attn_weights
return activations
class RoPEMaskedMultiheadAttention(nn.Module):
def __init__(self, config):
self.config = config
# Create a list of RoPEMaskedAttentionHead instances as attention heads
self.heads = nn.ModuleList([
RoPEMaskedAttentionHead(config) for _ in range(config['n_heads'])
self.linear = nn.Linear(config['n_heads'] * config['d_model'], config['d_model']) # Linear layer after concatenating heads
self.dropout = nn.Dropout(.1) # Dropout layer
def forward(self, x):
# x: input tensor of shape (batch, sequence length, dimension)
# Process each attention head and concatenate the results
heads = [h(x) for h in self.heads]
x = torch.cat(heads, dim=-1)
# Apply linear transformation to the concatenated output
x = self.linear(x)
# Apply dropout
x = self.dropout(x)
return x
# Update the master configuration with the number of attention heads
'n_heads': 8,
现在我们已经实施了旋转嵌入(Rotational Embedding)和多头注意力(Multi-head Attention),接下来让我们使用更新后的代码重写我们的RMSNorm神经网络模型。我们将测试其性能,计算损失,并检查参数数量。我们将这个更新后的模型称为“RopeModel”。
class RopeModel(nn.Module):
def __init__(self, config):
self.config = config
# Embedding layer for input tokens
self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])
# RMSNorm layer for pre-normalization
self.rms = RMSNorm((config['context_window'], config['d_model']))
# RoPEMaskedMultiheadAttention layer
self.rope_attention = RoPEMaskedMultiheadAttention(config)
# Linear layer followed by ReLU activation
self.linear = nn.Sequential(
nn.Linear(config['d_model'], config['d_model']),
# Final linear layer for prediction
self.last_linear = nn.Linear(config['d_model'], config['vocab_size'])
print("model params:", sum([m.numel() for m in self.parameters()]))
def forward(self, idx, targets=None):
# idx: input indices
x = self.embedding(idx)
# One block of attention
x = self.rms(x) # RMS pre-normalization
x = x + self.rope_attention(x)
x = self.rms(x) # RMS pre-normalization
x = x + self.linear(x)
logits = self.last_linear(x)
if targets is not None:
loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))
return logits, loss
return logits
# Create an instance of RopeModel (RMSNorm, RoPE, Multi-Head)
model = RopeModel(MASTER_CONFIG)
# Obtain batches for training
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])
# Calculate logits and loss using the model
logits, loss = model(xs, ys)
# Define the Adam optimizer for model parameters
optimizer = torch.optim.Adam(model.parameters())
# Train the model
train(model, optimizer)
让我们继续训练模型多几个周期,看看我们重建的LLaMA LLM的损失是否会继续下降。
# Updating training configuration with more epochs and a logging interval
"epochs": 5000,
"log_interval": 10,
# Training the model with the updated configuration
train(model, optimizer)
class SwiGLU(nn.Module):
""" Paper Link -> https://arxiv.org/pdf/2002.05202v1.pdf """
def __init__(self, size):
self.config = config # Configuration information
self.linear_gate = nn.Linear(size, size) # Linear transformation for the gating mechanism
self.linear = nn.Linear(size, size) # Linear transformation for the main branch
self.beta = torch.randn(1, requires_grad=True) # Random initialization of the beta parameter
# Using nn.Parameter for beta to ensure it's recognized as a learnable parameter
self.beta = nn.Parameter(torch.ones(1))
self.register_parameter("beta", self.beta)
def forward(self, x):
# Swish-Gated Linear Unit computation
swish_gate = self.linear_gate(x) * torch.sigmoid(self.beta * self.linear_gate(x))
out = swish_gate * self.linear(x) # Element-wise multiplication of the gate and main branch
return out
class RopeModel(nn.Module):
def __init__(self, config):
self.config = config
# Embedding layer for input tokens
self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])
# RMSNorm layer for pre-normalization
self.rms = RMSNorm((config['context_window'], config['d_model']))
# Multi-head attention layer with RoPE (Rotary Positional Embeddings)
self.rope_attention = RoPEMaskedMultiheadAttention(config)
# Linear layer followed by SwiGLU activation
self.linear = nn.Sequential(
nn.Linear(config['d_model'], config['d_model']),
SwiGLU(config['d_model']), # Adding SwiGLU activation
# Output linear layer
self.last_linear = nn.Linear(config['d_model'], config['vocab_size'])
# Printing total model parameters
print("model params:", sum([m.numel() for m in self.parameters()]))
def forward(self, idx, targets=None):
x = self.embedding(idx)
# One block of attention
x = self.rms(x) # RMS pre-normalization
x = x + self.rope_attention(x)
x = self.rms(x) # RMS pre-normalization
x = x + self.linear(x) # Applying SwiGLU activation
logits = self.last_linear(x)
if targets is not None:
# Calculate cross-entropy loss if targets are provided
loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))
return logits, loss
return logits
# Create an instance of RopeModel (RMSNorm, RoPE, Multi-Head, SwiGLU)
model = RopeModel(MASTER_CONFIG)
# Obtain batches for training
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])
# Calculate logits and loss using the model
logits, loss = model(xs, ys)
# Define the Adam optimizer for model parameters
optimizer = torch.optim.Adam(model.parameters())
# Train the model
train(model, optimizer)
# Update model configurations for the number of layers
'n_layers': 4, # Set the number of layers to 4
# add RMSNorm and residual connection
class LlamaBlock(nn.Module):
def __init__(self, config):
self.config = config
# RMSNorm layer
self.rms = RMSNorm((config['context_window'], config['d_model']))
# RoPE Masked Multihead Attention layer
self.attention = RoPEMaskedMultiheadAttention(config)
# Feedforward layer with SwiGLU activation
self.feedforward = nn.Sequential(
nn.Linear(config['d_model'], config['d_model']),
def forward(self, x):
# one block of attention
x = self.rms(x) # RMS pre-normalization
x = x + self.attention(x) # residual connection
x = self.rms(x) # RMS pre-normalization
x = x + self.feedforward(x) # residual connection
return x
# Create an instance of the LlamaBlock class with the provided configuration
block = LlamaBlock(MASTER_CONFIG)
# Generate a random tensor with the specified batch size, context window, and model dimension
random_input = torch.randn(MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'], MASTER_CONFIG['d_model'])
# Apply the LlamaBlock to the random input tensor
output = block(random_input)
class Llama(nn.Module):
def __init__(self, config):
self.config = config
# Embedding layer for token representations
self.embeddings = nn.Embedding(config['vocab_size'], config['d_model'])
# Sequential block of LlamaBlocks based on the specified number of layers
self.llama_blocks = nn.Sequential(
OrderedDict([(f"llama_{i}", LlamaBlock(config)) for i in range(config['n_layers'])])
# Feedforward network (FFN) for final output
self.ffn = nn.Sequential(
nn.Linear(config['d_model'], config['d_model']),
nn.Linear(config['d_model'], config['vocab_size']),
# Print total number of parameters in the model
print("model params:", sum([m.numel() for m in self.parameters()]))
def forward(self, idx, targets=None):
# Input token indices are passed through the embedding layer
x = self.embeddings(idx)
# Process the input through the LlamaBlocks
x = self.llama_blocks(x)
# Pass the processed input through the final FFN for output logits
logits = self.ffn(x)
# If targets are not provided, return only the logits
if targets is None:
return logits
# If targets are provided, compute and return the cross-entropy loss
loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))
return logits, loss
# Create an instance of RopeModel (RMSNorm, RoPE, Multi-Head, SwiGLU, N_layers)
llama = Llama(MASTER_CONFIG)
# Obtain batches for training
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])
# Calculate logits and loss using the model
logits, loss = llama(xs, ys)
# Define the Adam optimizer for model parameters
optimizer = torch.optim.Adam(llama.parameters())
# Train the model
train(llama, optimizer)
# Update the number of epochs in the configuration
'epochs': 10000,
# Train the LLaMA model for the specified number of epochs
train(llama, optimizer, scheduler=None, config=MASTER_CONFIG)
# Training the model again, scheduler for better optimization.
train(llama, optimizer, config=MASTER_CONFIG)
# Generate text using the trained LLM (llama) with a maximum of 500 tokens
generated_text = generate(llama, MASTER_CONFIG, 500)[0]
# Get batches from the test set
xs, ys = get_batches(dataset, 'test', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])
# Pass the test data through the LLaMA model
logits, loss = llama(xs, ys)
# Print the loss on the test set
# Update configuration
"epochs": 1000
# Create Llama model with Cosine Annealing learning schedule
llama_with_cosine = Llama(MASTER_CONFIG)
# Define Adam optimizer with specific hyperparameters
llama_optimizer = torch.optim.Adam(
betas=(.9, .95),
# Define Cosine Annealing learning rate scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(llama_optimizer, 300, eta_min=1e-5)
# Train the Llama model with the specified optimizer and scheduler
train(llama_with_cosine, llama_optimizer, scheduler=scheduler)
# Save the entire model
torch.save(llama, 'llama_model.pth')
# If you want to save only the model parameters
torch.save(llama.state_dict(), 'llama_model_params.pth')
要为 Hugging Face 的 Transformers 库保存你的 PyTorch 模型,你可以使用 save_pretrained 方法。以下是一个例子:
from transformers import GPT2LMHeadModel, GPT2Config
# Assuming Llama is your PyTorch model
llama_config = GPT2Config.from_dict(MASTER_CONFIG)
llama_transformers = GPT2LMHeadModel(config=llama_config)
# Specify the directory where you want to save the model
output_dir = "llama_model_transformers"
# Save the model and configuration
GPT2Config 用于创建一个与 GPT-2 兼容的配置对象。接着,创建一个 GPT2LMHeadModel 并加载你的 Llama 模型的权重。最后,调用 save_pretrained 将模型和配置保存在指定目录中。
然后你可以使用 Transformers 库来加载这个模型。
from transformers import GPT2LMHeadModel, GPT2Config
# Specify the directory where the model was saved
output_dir = "llama_model_transformers"
# Load the model and configuration
llama_transformers = GPT2LMHeadModel.from_pretrained(output_dir)