如何根据自己的数据微调Mixtral-8x7B-Instruct?

2024年02月02日 由 alex 发表 659 0

只需几分钟,分三个步骤:


设置环境


git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
conda create -n llama_factory python=3.10
conda activate llama_factory
pip install -r requirements.txt
pip install bitsandbytes>=0.39.0


在 data/example_dataset/examples.json 中按以下格式放入自己的数据:


3


运行微调脚本:


CUDA_VISIBLE_DEVICES=0 python src/train_bash.py     --stage sft     --do_train     --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1  --template mistral     --finetuning_type lora     --lora_target q_proj,v_proj     --output_dir mixtral     --per_device_train_batch_size 1     --gradient_accumulation_steps 8     --lr_scheduler_type cosine     --logging_steps 10     --save_steps 1000     --learning_rate 5e-5     --num_train_epochs 1.0     --quantization_bit 4  --bf16 --dataset example


上述脚本将使用 lora 启动 mistralai/Mixtral-8x7B-Instruct-v0.1 模型的微调:


  • --finetuning_type lora
  • --quantization_bit 4
  • --lora_target q_proj,v_proj
  • --output_dir mixtral


注意:在上面的代码中,我们只以 q_proj 和 v_proj 这两个张量为目标,但也可以添加其他几个 lora 目标:


[q_proj、k_proj、v_proj、o_proj、gate_proj、up_proj、down_proj)


在屏幕上,你会看到类似下面的内容:


4


这是完整的日志:


01/20/2024 09:51:04 - INFO - llmtuner.model.parser - Process rank: 0, device: cuda:0, n_gpu: 1
  distributed training: True, compute dtype: torch.bfloat16
01/20/2024 09:51:04 - INFO - llmtuner.model.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
...
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
01/20/2024 09:51:04 - INFO - llmtuner.data.loader - Loading dataset example_dataset...
Generating train split
Generating train split: 2 examples [00:00, 87.18 examples/s]
Unable to verify splits sizes.
[INFO|configuration_utils.py:802] 2024-01-20 09:51:04,719 >> Model config MixtralConfig {
  "_name_or_path": "mistralai/Mixtral-8x7B-Instruct-v0.1",
  "architectures": [
    "MixtralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mixtral",
  "num_attention_heads": 32,
  "num_experts_per_tok": 2,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "num_local_experts": 8,
  "output_router_logits": false,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "router_aux_loss_coef": 0.02,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.36.2",
  "use_cache": true,
  "vocab_size": 32000
}
01/20/2024 09:51:04 - INFO - llmtuner.model.patcher - Quantizing model to 4 bit.
[INFO|modeling_utils.py:1341] 2024-01-20 09:51:04,827 >> Instantiating MixtralForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:826] 2024-01-20 09:51:04,828 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}
[INFO|modeling_utils.py:3483] 2024-01-20 09:51:06,707 >> Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:23<00:00,  4.40s/it]
[INFO|modeling_utils.py:4185] 2024-01-20 09:52:31,516 >> All model checkpoint weights were used when initializing MixtralForCausalLM.
[INFO|modeling_utils.py:4193] 2024-01-20 09:52:31,516 >> All the weights of MixtralForCausalLM were initialized from the model checkpoint at mistralai/Mixtral-8x7B-Instruct-v0.1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MixtralForCausalLM for predictions without further training.
[INFO|configuration_utils.py:826] 2024-01-20 09:52:31,591 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}
01/20/2024 09:52:32 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
01/20/2024 09:52:32 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
01/20/2024 09:52:32 - INFO - llmtuner.model.loader - trainable params: 3407872 || all params: 46706200576 || trainable%: 0.0073
01/20/2024 09:52:32 - INFO - llmtuner.data.template - Add pad token: </s>
Running tokenizer on dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 183.49 examples/s]
input_ids:
[1, 733, 16289, 28793, 22557, 733, 28748, 16289, 28793, 22557, 28725, 315, 837, 523, 4833, 6550, 396, 16107, 13892, 6202, 486, 523, 18038, 1017, 13902, 22277, 298, 2647, 368, 28723, 1824, 541, 315, 511, 354, 368, 28804, 2]
inputs:
<s> [INST] Hello [/INST] Hello, I am <NAME>, an AI assistant developed by <AUTHOR>. Nice to meet you. What can I do for you?</s>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, 22557, 28725, 315, 837, 523, 4833, 6550, 396, 16107, 13892, 6202, 486, 523, 18038, 1017, 13902, 22277, 298, 2647, 368, 28723, 1824, 541, 315, 511, 354, 368, 28804, 2]
labels:
 Hello, I am <NAME>, an AI assistant developed by <AUTHOR>. Nice to meet you. What can I do for you?</s>
example:{'input_ids': [1, 733, 16289, 28793, 22557, 733, 28748, 16289, 28793, 22557, 28725, 315, 837, 523, 4833, 6550, 396, 16107, 13892, 6202, 486, 523, 18038, 1017, 13902, 22277, 298, 2647, 368, 28723, 1824, 541, 315, 511, 354, 368, 28804, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, 22557, 28725, 315, 837, 523, 4833, 6550, 396, 16107, 13892, 6202, 486, 523, 18038, 1017, 13902, 22277, 298, 2647, 368, 28723, 1824, 541, 315, 511, 354, 368, 28804, 2]}
[INFO|training_args.py:1838] 2024-01-20 09:52:32,684 >> PyTorch: setting up devices
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:568] 2024-01-20 09:52:32,687 >> Using auto half precision backend
[INFO|trainer.py:1706] 2024-01-20 09:52:32,965 >> ***** Running training *****
[INFO|trainer.py:1707] 2024-01-20 09:52:32,965 >>   Num examples = 2
[INFO|trainer.py:1708] 2024-01-20 09:52:32,965 >>   Num Epochs = 1
[INFO|trainer.py:1709] 2024-01-20 09:52:32,965 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1712] 2024-01-20 09:52:32,965 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:1713] 2024-01-20 09:52:32,965 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:1714] 2024-01-20 09:52:32,965 >>   Total optimization steps = 1
[INFO|trainer.py:1715] 2024-01-20 09:52:32,971 >>   Number of trainable parameters = 3,407,872
  warnings.warn(
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.55s/it][INFO|trainer.py:1947] 2024-01-20 09:52:37,530 >> 
Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 4.559, 'train_samples_per_second': 0.439, 'train_steps_per_second': 0.219, 'train_loss': 1.781388759613037, 'epoch': 1.0}                                                                                  
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.56s/it]
[INFO|trainer.py:2889] 2024-01-20 09:52:37,535 >> Saving model checkpoint to mixtral
[INFO|tokenization_utils_base.py:2432] 2024-01-20 09:52:37,665 >> tokenizer config file saved in mixtral/tokenizer_config.json
[INFO|tokenization_utils_base.py:2441] 2024-01-20 09:52:37,670 >> Special tokens file saved in mixtral/special_tokens_map.json
***** train metrics *****
  epoch                    =        1.0
  train_loss               =     1.7814
  train_runtime            = 0:00:04.55
  train_samples_per_second =      0.439
  train_steps_per_second   =      0.219


完成后,Lora 模型将保存到 "mixtral "文件夹中。


5


文件夹 mixtral 包含以下文件:


adapter_config.json 
adapter_model.safetensors       
all_results.json  
special_tokens_map.json  
tokenizer.model    
trainer_state.json  
train_results.json 
tokenizer_config.json    
trainer_log.jsonl  
training_args.bin
README.md


以下是 mixtral/README.md 文件:


---
license: other
library_name: peft
tags:
- llama-factory
- lora
- generated_from_trainer
datasets:
- example_dataset
base_model: mistralai/Mixtral-8x7B-Instruct-v0.1
model-index:
- name: mixtral
  results: []
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# mixtral
This model is a fine-tuned version of [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) on the example dataset.
## Model description
More information needed
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 8
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- num_epochs: 1.0
### Training results
### Framework versions
- PEFT 0.7.1
- Transformers 4.36.2
- Pytorch 2.1.2+cu121
- Datasets 2.16.1
- Tokenizers 0.15.0


如何运行微调模型?


python src/cli_demo.py --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --adapter_name_or_path mixtral --template default --finetuning_type lora --quantization_bit=4


注意:如果你的 GPU VRAM 不够大,请添加 - quantization_bit=4,以运行 4bit 量化模型。


下面是结果的屏幕截图:


6


如何在不使用 GPU 的情况下运行微调模型?


你也可以使用 llama.cpp 运行微调模型:


合并 lora 模型


python src/export_model.py --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --adapter_name_or_path mixtral --template default --finetuning_type lora --export_dir mixtral-merge --export_size 2 --export_legacy_format False


在 llama.cpp 中,将合并后的模型转换为 gguf 格式


python convert.py mixtral-merge/


在 llama.cpp 中,运行 gguf 模型


./main -m mixtral-merge/ggml-model-f16.gguf -p "hello"


模型统计:params = 46.70 B size = 86.99 GiB (16.00 BPW)


对于 16 位模型,在没有 GPU 的 Linux 机器上,令牌生成速度约为 3 令牌/秒。如果使用一个 A100 40G GPU,速度为 ~5 token/秒(-ngl 12)(GPU HBM 只能加载 12 层)。


如果将 16 位模型量化为 4 位,在单个 A100 40G 上的速度将提高到 ~61 token/秒。具体方法如下:


./quantize mixtral-merge/ggml-model-f16.gguf mixtral-merge/ggml-model-q4_0.gguf q4_0
./main -m mixtral-merge/mixtral-8x7b-instruct-v0.1.Q4_0.gguf -p "hello" -ngl 48


llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA-Factory
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:                         llama.expert_count u32              = 8
llama_model_loader: - kv  10:                    llama.expert_used_count u32              = 2
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 1
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  930 tensors
llama_model_quantize_internal ============ Strange model: n_attention_wv = 32, n_feed_forward_w2 = 256, hparams.n_layer = 32
llama_model_quantize_internal: meta size = 780096 bytes
[   1/ 995]                    token_embd.weight - [ 4096, 32000,     1,     1], type =    f16, quantizing to q4_0 .. size =   250.00 MiB ->    70.31 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.110 0.120 0.110 0.096 0.076 0.057 0.039 0.025 0.021
....
[ 645/ 995]             blk.20.ffn_gate.6.weight - [ 4096, 14336,     1,     1], type =    f16, quantizing to q4_0 .. size =   112.00 MiB ->    31.50 MiB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 


llama_print_timings:        eval time =    3898.13 ms /   238 runs   (   16.38 ms per token,    61.05 tokens per second)


单个 A100 40G 的速度(4 位量化 Mixtral 8×7B MoE): 每秒 61.05 个令牌

文章来源:https://blog.gopenai.com/how-to-fine-tune-mixtral-8x7b-instruct-on-your-own-data-78f3b2f8c808
欢迎关注ATYUN官方公众号
商务合作及内容投稿请联系邮箱:bd@atyun.com
评论 登录
热门职位
Maluuba
20000~40000/月
Cisco
25000~30000/月 深圳市
PilotAILabs
30000~60000/年 深圳市
写评论取消
回复取消