LLaMAFactory Megatron的加速
2026-01-04
LLaMA-Factory
00

目录

1. Dockerfile.megatron 是什么?
2. Megatron 训练会更快吗?
3. 如何启用 Megatron 训练?
方法一:设置环境变量
注意事项
4. 镜像构建
5. 训练

1. Dockerfile.megatron 是什么?

Dockerfile.megatron 是用于构建支持 Megatron-Core 训练的 Docker 镜像的 Dockerfile。主要特点:

它安装了:

  • megatron-core==0.13.0:Megatron 核心库
  • transformer-engine[pytorch]==2.2.0:Transformer 优化引擎
  • deepspeed==0.16.4:DeepSpeed 训练框架
  • mcore_adapter(第65行):Megatron Core Adapter

2. Megatron 训练会更快吗?

通常会更快。Megatron 提供多种优化:

  • 模型并行:张量模型并行、流水线模型并行、序列并行
  • 融合优化:bias_activation_fusionapply_rope_fusion
  • 分布式优化器:use_distributed_optimizer
  • 通信重叠:overlap_param_gatheroverlap_grad_reduce
  • 其他:MoE 优化(如 moe_grouped_gemm

示例配置:

23:35:examples/megatron/qwen3_moe_full.yaml
展开代码
# mcore speed up tensor_model_parallel_size: 1 sequence_parallel: false pipeline_model_parallel_size: 4 bias_activation_fusion: true apply_rope_fusion: true use_distributed_optimizer: true overlap_param_gather: true overlap_grad_reduce: true moe_grouped_gemm: true moe_token_dispatcher_type: alltoall expert_model_parallel_size: 2 recompute_granularity: full

3. 如何启用 Megatron 训练?

方法一:设置环境变量

设置 USE_MCA=1 启用 Megatron Core Adapter:

bash
展开代码
export USE_MCA=1 llamafactory-cli train examples/megatron/qwen3_moe_full.yaml

代码中的检查逻辑:

26:36:src/llamafactory/hparams/training_args.py
展开代码
if is_env_enabled("USE_MCA"): if not is_mcore_adapter_available(): raise ImportError( "mcore_adapter is required when USE_MCA=1. Please install `mcore_adapter` and its dependencies." ) from mcore_adapter import Seq2SeqTrainingArguments as McaSeq2SeqTrainingArguments BaseTrainingArguments = McaSeq2SeqTrainingArguments else: BaseTrainingArguments = Seq2SeqTrainingArguments
  • finetuning_type: full # Megatron 目前只支持 full 微调
  • use_mca: true # 这个参数由 USE_MCA 环境变量控制

注意事项

  1. 需要安装 mcore_adapter

    bash
    展开代码
    pip install "git+https://github.com/alibaba/roll.git#subdirectory=mcore_adapter"
  2. 目前仅支持 finetuning_type: full(全量微调),不支持 LoRA。

  3. 支持的训练阶段:pt(预训练)、sft(监督微调)、dpo(直接偏好优化)。

  4. 训练入口判断逻辑:

69:83:src/llamafactory/train/tuner.py
展开代码
if finetuning_args.stage in ["pt", "sft", "dpo"] and finetuning_args.use_mca: if not is_mcore_adapter_available(): raise ImportError("mcore_adapter is not installed. Please install it with `pip install mcore-adapter`.") if finetuning_args.stage == "pt": from .mca import run_pt as run_pt_mca run_pt_mca(model_args, data_args, training_args, finetuning_args, callbacks) elif finetuning_args.stage == "sft": from .mca import run_sft as run_sft_mca run_sft_mca(model_args, data_args, training_args, finetuning_args, callbacks) elif finetuning_args.stage == "dpo": from .mca import run_dpo as run_dpo_mca run_dpo_mca(model_args, data_args, training_args, finetuning_args, callbacks)

总结:设置 USE_MCA=1 环境变量,使用支持 Megatron 的 Docker 镜像或安装 mcore_adapter,然后在配置文件中设置 Megatron 相关参数即可启用。

4. 镜像构建

构建 apex whl 文件的脚本

bash
展开代码
#!/bin/bash # 构建 apex whl 文件的脚本 # 创建输出目录 mkdir -p /workspace/wheels # 配置 pip 源 export PIP_INDEX=${PIP_INDEX:-https://mirrors.aliyun.com/pypi/simple/} export PIP_TRUSTED_HOST=${PIP_TRUSTED_HOST:-mirrors.aliyun.com} # 先安装构建依赖 pip install --no-cache-dir -i ${PIP_INDEX} --trusted-host ${PIP_TRUSTED_HOST} \ packaging wheel setuptools pyproject-metadata # 克隆 apex 仓库(指定分支/标签,与 Dockerfile 保持一致) git clone --depth 1 --branch 25.04 https://github.com/NVIDIA/apex.git /workspace/apex # 进入 apex 目录 cd /workspace/apex # 设置环境变量(与 Dockerfile 中保持一致) export MAX_JOBS=32 export NINJA_FLAGS="-j32" export NVCC_APPEND_FLAGS="--threads 32" # 使用 pip wheel 构建 whl 文件(不安装) # --config-settings 传递构建选项,与 Dockerfile 保持一致 MAX_JOBS=32 NINJA_FLAGS="-j32" NVCC_APPEND_FLAGS="--threads 32" \ pip wheel -v --disable-pip-version-check --no-cache-dir --no-build-isolation \ --config-settings "--build-option=--cpp_ext --cuda_ext --parallel 32" \ -w /workspace/wheels . echo "apex whl 文件已构建完成,保存在 /workspace/wheels 目录" ls -lh /workspace/wheels/*.whl
bash
展开代码
docker build -f ./docker/docker-cuda/Dockerfile.megatron \ --build-arg PIP_INDEX=https://pypi.org/simple \ --build-arg EXTRAS=metrics,swanlab,vllm \ -t llamafactory:latest .

meiguo机器:

bash
展开代码
docker run -it --net host \ --gpus all \ --ipc=host \ --shm-size=8g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -v /data/xiedong:/data/xiedong \ nvcr.io/nvidia/pytorch:25.04-py3 bash docker run -it --net host \ --gpus all \ --ipc=host \ --shm-size=8g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -v /data/xiedong:/data/xiedong \ kevinchina/deeplearning:llamafactory0-9-4-base-1-megatron-1 bash export DEBIAN_FRONTEND=noninteractive export PIP_ROOT_USER_ACTION=ignore pip install --upgrade pip setuptools wheel "hatchling>=1.18.0" editables --trusted-host ${PYPI_TRUSTED_HOST} --index-url ${PYPI_MIRROR} pip uninstall -y torch torchvision torch-tensorrt \ flash_attn transformer-engine \ cudf dask-cuda cugraph cugraph-service-server cuml raft-dask cugraph-dgl cugraph-pyg dask-cudf pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124 pip uninstall -y opencv opencv-python opencv-python-headless && \ rm -rf /usr/local/lib/python3.10/dist-packages/cv2/ && \ pip install opencv-python-headless==4.11.0.86 --trusted-host ${PYPI_TRUSTED_HOST} --index-url ${PYPI_MIRROR} pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.2.post1/flash_attn-2.7.2.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl apt-get update && apt-get install -y zip apt-get install -y --no-install-recommends \ locales \ aria2 \ fonts-noto-cjk \ language-pack-zh-hans \ zip \ unzip \ tree \ vim \ tzdata \ apt-utils \ htop \ tmux \ curl \ wget \ git \ file \ net-tools \ libibverbs1 \ libibverbs-dev \ build-essential \ ca-certificates pip install pybind11 s3fs decord msgspec opencv-python megfile math_verify wandb swanlab pip install "git+https://github.com/alibaba/roll.git#subdirectory=mcore_adapter" apt-get install -y openjdk-21-jdk ENV JAVA_HOME /usr/lib/jvm/java-21-openjdk-amd64 conda update -y openssl || true && \ apt-get update && \ DEBIAN_FRONTEND=noninteractive apt-get install -y openssh-server || \ (DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends openssh-server openssh-client && \ mkdir -p /etc/ssh && \ ssh-keygen -A || true) # wang 机器 docker run -it --net host \ --ipc=host \ --shm-size=8g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --gpus all \ -v /data/xiedong:/data/xiedong \ kevinchina/deeplearning:llamafactory-megatron-base-2 bash pip install "numpy==1.26.4" "optree>=0.13.0" "spacy==3.7.5" "weasel==0.4.1" \ transformer-engine[pytorch]==2.2.0 megatron-core==0.13.0 deepspeed==0.16.4 \ --no-build-isolation export apex_url=git+https://github.com/NVIDIA/apex.git@25.04 pip uninstall -y apex && \ MAX_JOBS=2 NINJA_FLAGS="-j2" NVCC_APPEND_FLAGS="--threads 2" \ pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \ --config-settings "--build-option=--cpp_ext --cuda_ext --parallel 2" ${apex_url} export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH # 推送镜像 docker commit fc3b244ee669 kevinchina/deeplearning:llamafactory-megatron-base-4 docker push kevinchina/deeplearning:llamafactory-megatron-base-4
bash
展开代码
git clone https://github.com/hiyouga/LlamaFactory.git -b v0.9.4 --depth 1 /app cd /app && pip install --no-cache-dir -e . --no-build-isolation
bash
展开代码
FORCE_TORCHRUN=1 llamafactory-cli train /path/to/qwen3_vl_2b_full.yaml USE_MCA=1 llamafactory-cli train /mnt/s3fs/train-LlamaFactory/examples/megatron/qwen3_vl_2b_full.yaml USE_MCA=1 llamafactory-cli train /data/xiedong/LlamaFactory/examples/megatron/qwen3_full.yaml

5. 训练

上面的镜像构建累死人。训练报错也是绷不住哭出声音:

bash
展开代码
USE_MCA=1 llamafactory-cli train /data/xiedong/LlamaFactory/examples/megatron/qwen3_full.yaml
参数含义当前值说明
tensor_model_parallel_size张量并行大小1在权重矩阵维度上切分,1=不并行
pipeline_model_parallel_size流水线并行大小1按层切分到多个 GPU,需要至少对应数量的进程
sequence_parallel序列并行false在序列长度维度并行注意力计算
bias_activation_fusion偏置激活融合true优化计算,减少内存访问
apply_rope_fusionRoPE 融合true优化位置编码计算
use_distributed_optimizer分布式优化器true优化器状态分片

不写了,我留下一篇写。

如果对你有用的话,可以打赏哦
打赏
ali pay
wechat pay

本文作者:Dong

本文链接:

版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!