【深度学习】LLaMA-Factory，Why is LoRA much slower than Freeze?

分享issues：

https://github.com/hiyouga/LLaMA-Factory/issues/5398

从中获取到Deepspeed zero 3 在如何使用。

Freeze Method Log:

bash
展开代码
deepspeed --include localhost:0,1,2,3 --master_port=19915 src/train.py  \
    --deepspeed ./mine/ds_z3_config.json \
    --stage pt \
    --model_name_or_path /tmp/Qwen2-72B-Instruct \
    --do_train \
    --dataset xxx \
    --template qwen \
    --finetuning_type freeze \
    --freeze_trainable_layers 3 \
    --freeze_trainable_modules all \
    --output_dir xxx \
    --overwrite_cache \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 64 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --learning_rate 1e-5 \
    --num_train_epochs 6.0 \
    --plot_loss \
    --preprocessing_num_workers 48 \
    --bf16 \
    --tokenized_path xxx \
    --cutoff_len 8000 \
    --ddp_timeout 18000 \
    --save_total_limit 5

Set trainable layers: 77,78,79
trainable params: 2633054208 || all params: 72706203648 || trainable%: 3.6215
epochs:10
train_runtime = 20:36:14.47
train_samples_per_second = 1.051
train_steps_per_second = 0.008

LoRA Method Log:

bash
展开代码
deepspeed --include localhost:0,1,2,3,4,5,6,7 --master_port=25091 src/train.py  \
    --deepspeed ./mine/ds_z3_config.json \
    --stage pt \
    --model_name_or_path /tmp/Qwen2-72B-Instruct \
    --do_train \
    --dataset xxx \
    --template qwen \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir xxx \
    --overwrite_cache \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 32 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --learning_rate 1e-4 \
    --num_train_epochs 6.0 \
    --plot_loss \
    --preprocessing_num_workers 48 \
    --bf16 \
    --tokenized_path xxx \
    --cutoff_len 8000 \
    --ddp_timeout 18000 \
    --save_total_limit 10 \
    --save_only_model

trainable params: 16384000 || all params: 72722587648 || trainable%: 0.0225
epochs:6
train_runtime = 1 day, 8:57:44.99
train_samples_per_second = 0.394
train_steps_per_second = 0.002

Deepspeed：

bash
展开代码
{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

lora requires more activation recomputes. see https://arxiv.org/pdf/2403.13372

目录

Freeze Method Log:

LoRA Method Log:

Deepspeed：