只需几分钟,分三个步骤:
设置环境
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
conda create -n llama_factory python=3.10
conda activate llama_factory
pip install -r requirements.txt
pip install bitsandbytes>=0.39.0
在 data/example_dataset/examples.json 中按以下格式放入自己的数据:
运行微调脚本:
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py --stage sft --do_train --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --template mistral --finetuning_type lora --lora_target q_proj,v_proj --output_dir mixtral --per_device_train_batch_size 1 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --logging_steps 10 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 1.0 --quantization_bit 4 --bf16 --dataset example
上述脚本将使用 lora 启动 mistralai/Mixtral-8x7B-Instruct-v0.1 模型的微调:
注意:在上面的代码中,我们只以 q_proj 和 v_proj 这两个张量为目标,但也可以添加其他几个 lora 目标:
[q_proj、k_proj、v_proj、o_proj、gate_proj、up_proj、down_proj)
在屏幕上,你会看到类似下面的内容:
这是完整的日志:
01/20/2024 09:51:04 - INFO - llmtuner.model.parser - Process rank: 0, device: cuda:0, n_gpu: 1
distributed training: True, compute dtype: torch.bfloat16
01/20/2024 09:51:04 - INFO - llmtuner.model.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
...
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
01/20/2024 09:51:04 - INFO - llmtuner.data.loader - Loading dataset example_dataset...
Generating train split
Generating train split: 2 examples [00:00, 87.18 examples/s]
Unable to verify splits sizes.
[INFO|configuration_utils.py:802] 2024-01-20 09:51:04,719 >> Model config MixtralConfig {
"_name_or_path": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"architectures": [
"MixtralForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "mixtral",
"num_attention_heads": 32,
"num_experts_per_tok": 2,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"num_local_experts": 8,
"output_router_logits": false,
"rms_norm_eps": 1e-05,
"rope_theta": 1000000.0,
"router_aux_loss_coef": 0.02,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.36.2",
"use_cache": true,
"vocab_size": 32000
}
01/20/2024 09:51:04 - INFO - llmtuner.model.patcher - Quantizing model to 4 bit.
[INFO|modeling_utils.py:1341] 2024-01-20 09:51:04,827 >> Instantiating MixtralForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:826] 2024-01-20 09:51:04,828 >> Generate config GenerationConfig {
"bos_token_id": 1,
"eos_token_id": 2
}
[INFO|modeling_utils.py:3483] 2024-01-20 09:51:06,707 >> Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:23<00:00, 4.40s/it]
[INFO|modeling_utils.py:4185] 2024-01-20 09:52:31,516 >> All model checkpoint weights were used when initializing MixtralForCausalLM.
[INFO|modeling_utils.py:4193] 2024-01-20 09:52:31,516 >> All the weights of MixtralForCausalLM were initialized from the model checkpoint at mistralai/Mixtral-8x7B-Instruct-v0.1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MixtralForCausalLM for predictions without further training.
[INFO|configuration_utils.py:826] 2024-01-20 09:52:31,591 >> Generate config GenerationConfig {
"bos_token_id": 1,
"eos_token_id": 2
}
01/20/2024 09:52:32 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
01/20/2024 09:52:32 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
01/20/2024 09:52:32 - INFO - llmtuner.model.loader - trainable params: 3407872 || all params: 46706200576 || trainable%: 0.0073
01/20/2024 09:52:32 - INFO - llmtuner.data.template - Add pad token: </s>
Running tokenizer on dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 183.49 examples/s]
input_ids:
[1, 733, 16289, 28793, 22557, 733, 28748, 16289, 28793, 22557, 28725, 315, 837, 523, 4833, 6550, 396, 16107, 13892, 6202, 486, 523, 18038, 1017, 13902, 22277, 298, 2647, 368, 28723, 1824, 541, 315, 511, 354, 368, 28804, 2]
inputs:
<s> [INST] Hello [/INST] Hello, I am <NAME>, an AI assistant developed by <AUTHOR>. Nice to meet you. What can I do for you?</s>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, 22557, 28725, 315, 837, 523, 4833, 6550, 396, 16107, 13892, 6202, 486, 523, 18038, 1017, 13902, 22277, 298, 2647, 368, 28723, 1824, 541, 315, 511, 354, 368, 28804, 2]
labels:
Hello, I am <NAME>, an AI assistant developed by <AUTHOR>. Nice to meet you. What can I do for you?</s>
example:{'input_ids': [1, 733, 16289, 28793, 22557, 733, 28748, 16289, 28793, 22557, 28725, 315, 837, 523, 4833, 6550, 396, 16107, 13892, 6202, 486, 523, 18038, 1017, 13902, 22277, 298, 2647, 368, 28723, 1824, 541, 315, 511, 354, 368, 28804, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, 22557, 28725, 315, 837, 523, 4833, 6550, 396, 16107, 13892, 6202, 486, 523, 18038, 1017, 13902, 22277, 298, 2647, 368, 28723, 1824, 541, 315, 511, 354, 368, 28804, 2]}
[INFO|training_args.py:1838] 2024-01-20 09:52:32,684 >> PyTorch: setting up devices
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:568] 2024-01-20 09:52:32,687 >> Using auto half precision backend
[INFO|trainer.py:1706] 2024-01-20 09:52:32,965 >> ***** Running training *****
[INFO|trainer.py:1707] 2024-01-20 09:52:32,965 >> Num examples = 2
[INFO|trainer.py:1708] 2024-01-20 09:52:32,965 >> Num Epochs = 1
[INFO|trainer.py:1709] 2024-01-20 09:52:32,965 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1712] 2024-01-20 09:52:32,965 >> Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:1713] 2024-01-20 09:52:32,965 >> Gradient Accumulation steps = 8
[INFO|trainer.py:1714] 2024-01-20 09:52:32,965 >> Total optimization steps = 1
[INFO|trainer.py:1715] 2024-01-20 09:52:32,971 >> Number of trainable parameters = 3,407,872
warnings.warn(
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.55s/it][INFO|trainer.py:1947] 2024-01-20 09:52:37,530 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 4.559, 'train_samples_per_second': 0.439, 'train_steps_per_second': 0.219, 'train_loss': 1.781388759613037, 'epoch': 1.0}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.56s/it]
[INFO|trainer.py:2889] 2024-01-20 09:52:37,535 >> Saving model checkpoint to mixtral
[INFO|tokenization_utils_base.py:2432] 2024-01-20 09:52:37,665 >> tokenizer config file saved in mixtral/tokenizer_config.json
[INFO|tokenization_utils_base.py:2441] 2024-01-20 09:52:37,670 >> Special tokens file saved in mixtral/special_tokens_map.json
***** train metrics *****
epoch = 1.0
train_loss = 1.7814
train_runtime = 0:00:04.55
train_samples_per_second = 0.439
train_steps_per_second = 0.219
完成后,Lora 模型将保存到 "mixtral "文件夹中。
文件夹 mixtral 包含以下文件:
adapter_config.json
adapter_model.safetensors
all_results.json
special_tokens_map.json
tokenizer.model
trainer_state.json
train_results.json
tokenizer_config.json
trainer_log.jsonl
training_args.bin
README.md
以下是 mixtral/README.md 文件:
---
license: other
library_name: peft
tags:
- llama-factory
- lora
- generated_from_trainer
datasets:
- example_dataset
base_model: mistralai/Mixtral-8x7B-Instruct-v0.1
model-index:
- name: mixtral
results: []
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# mixtral
This model is a fine-tuned version of [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) on the example dataset.
## Model description
More information needed
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 8
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- num_epochs: 1.0
### Training results
### Framework versions
- PEFT 0.7.1
- Transformers 4.36.2
- Pytorch 2.1.2+cu121
- Datasets 2.16.1
- Tokenizers 0.15.0
如何运行微调模型?
python src/cli_demo.py --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --adapter_name_or_path mixtral --template default --finetuning_type lora --quantization_bit=4
注意:如果你的 GPU VRAM 不够大,请添加 - quantization_bit=4,以运行 4bit 量化模型。
下面是结果的屏幕截图:
如何在不使用 GPU 的情况下运行微调模型?
你也可以使用 llama.cpp 运行微调模型:
合并 lora 模型
python src/export_model.py --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --adapter_name_or_path mixtral --template default --finetuning_type lora --export_dir mixtral-merge --export_size 2 --export_legacy_format False
在 llama.cpp 中,将合并后的模型转换为 gguf 格式
python convert.py mixtral-merge/
在 llama.cpp 中,运行 gguf 模型
./main -m mixtral-merge/ggml-model-f16.gguf -p "hello"
模型统计:params = 46.70 B size = 86.99 GiB (16.00 BPW)
对于 16 位模型,在没有 GPU 的 Linux 机器上,令牌生成速度约为 3 令牌/秒。如果使用一个 A100 40G GPU,速度为 ~5 token/秒(-ngl 12)(GPU HBM 只能加载 12 层)。
如果将 16 位模型量化为 4 位,在单个 A100 40G 上的速度将提高到 ~61 token/秒。具体方法如下:
./quantize mixtral-merge/ggml-model-f16.gguf mixtral-merge/ggml-model-q4_0.gguf q4_0
./main -m mixtral-merge/mixtral-8x7b-instruct-v0.1.Q4_0.gguf -p "hello" -ngl 48
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA-Factory
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.expert_count u32 = 8
llama_model_loader: - kv 10: llama.expert_used_count u32 = 2
llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: general.file_type u32 = 1
llama_model_loader: - kv 14: tokenizer.ggml.model str = llama
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type f16: 930 tensors
llama_model_quantize_internal ============ Strange model: n_attention_wv = 32, n_feed_forward_w2 = 256, hparams.n_layer = 32
llama_model_quantize_internal: meta size = 780096 bytes
[ 1/ 995] token_embd.weight - [ 4096, 32000, 1, 1], type = f16, quantizing to q4_0 .. size = 250.00 MiB -> 70.31 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.110 0.120 0.110 0.096 0.076 0.057 0.039 0.025 0.021
....
[ 645/ 995] blk.20.ffn_gate.6.weight - [ 4096, 14336, 1, 1], type = f16, quantizing to q4_0 .. size = 112.00 MiB -> 31.50 MiB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021
llama_print_timings: eval time = 3898.13 ms / 238 runs ( 16.38 ms per token, 61.05 tokens per second)
单个 A100 40G 的速度(4 位量化 Mixtral 8×7B MoE): 每秒 61.05 个令牌