分布式训练配置方案

针对您的场景（普通Qwen3-VL 8B，4机32卡），我给出最优配置方案：

一、硬件拓扑分析

节点数量：4台机器
单机配置：8卡/机
总计GPU：32卡
节点内带宽：NVLink/NVSwitch (900 GB/s)
节点间带宽：InfiniBand/RoCE (200-400 Gb/s)
核心原则：TP在节点内，PP/DP可跨节点

二、推荐配置方案

🏆 方案1：TP=4, PP=1, DP=8（最推荐）

核心配置：

yaml
展开代码
# 基础配置
model_name_or_path: Qwen/Qwen3-VL-8B-Instruct
do_train: true
stage: sft
finetuning_type: full
freeze_vision_tower: false
freeze_multi_modal_projector: false
freeze_language_model: false

# 数据设置
dataset: your_dataset
template: qwen3_vl
cutoff_len: 4096
image_max_pixels: 451584

# 训练参数
output_dir: saves/qwen3_vl_8b_32gpu
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
num_train_epochs: 3
learning_rate: 2e-5
lr_scheduler_type: cosine
bf16: true
logging_steps: 5
save_steps: 200

# Megatron并行配置
tensor_model_parallel_size: 4       # 4卡TP（节点内）
pipeline_model_parallel_size: 1     # 不用PP
sequence_parallel: false

# 性能优化
bias_activation_fusion: true
apply_rope_fusion: true
use_distributed_optimizer: true
overlap_param_gather: true
overlap_grad_reduce: true

并行拓扑：


展开代码
数据并行 (DP) = 32 / (4 * 1) = 8 路
- 每个节点2个DP组（8卡/节点 ÷ 4卡TP = 2组）
- 8个DP组分布在4台机器上

张量并行 (TP) = 4
- 每4卡一组，共享模型权重切片
- 在节点内或跨2个节点

全局Batch Size = 8 (DP) × 1 (bs) × 8 (acc) = 64

优势： ✅ TP=4足够放下8B模型单层的切片 ✅ DP=8提供充足的并行度 ✅ 避免PP的bubble损失（PP效率通常85-90%） ✅ 跨节点只有DP梯度同步，可与计算重叠

🥈 方案2：TP=8, PP=1, DP=4（显存更宽裕）

yaml
展开代码
# 与方案1不同之处
per_device_train_batch_size: 2      # TP=8显存更多，可增大bs
gradient_accumulation_steps: 8
tensor_model_parallel_size: 8       # 整机8卡做TP
pipeline_model_parallel_size: 1
sequence_parallel: true             # TP=8建议开启

并行拓扑：


展开代码
DP = 32 / 8 = 4 (每机一组TP，4机=4个DP组)
TP = 8 (单机8卡，充分利用NVLink)
全局Batch Size = 4 × 2 × 8 = 64

优势： ✅ TP=8在单节点内，通信最快 ✅ 每卡显存占用最小（模型切成8份） ✅ 可以增大per_device_batch_size到2 ⚠️ DP=4并行度较低，但batch size可补偿

🥉 方案3：TP=2, PP=2, DP=8（平衡方案）

yaml
展开代码
# 与方案1不同之处
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
tensor_model_parallel_size: 2       # 最小TP
pipeline_model_parallel_size: 2     # 2段流水线
sequence_parallel: false

# PP专用参数
virtual_pipeline_model_parallel_size: 2  # 虚拟流水线减少bubble

并行拓扑：


展开代码
DP = 32 / (2 * 2) = 8
TP = 2 (每2卡一组)
PP = 2 (模型分2段)
每个节点：8卡 / 2(TP) = 4组，每组再分2段PP

优势： ✅ DP=8高并行度 ✅ TP=2通信开销小 ⚠️ PP=2引入10-15% bubble loss ⚠️ 配置较复杂

三、关键参数调优建议

1. sequence_parallel 设置

yaml
展开代码
# 当 TP >= 4 时开启，可节省激活内存
tensor_model_parallel_size: 4
sequence_parallel: true  # 沿序列维度并行，降低激活显存

# TP < 4 时不建议开启
tensor_model_parallel_size: 2
sequence_parallel: false

2. recompute_granularity 设置

bash
展开代码
# 估算Qwen3-VL 8B显存占用
python3 -c "
model_params = 8e9  # 8B参数
bytes_per_param = 2  # bf16

# 模型权重
model_memory = model_params * bytes_per_param / 1e9  # GB

# 优化器状态 (Adam: 8 bytes/param for fp32 momentum+variance)
optimizer_memory = model_params * 8 / 1e9

# 梯度
gradient_memory = model_params * bytes_per_param / 1e9

# 激活（粗略估计）
activation_memory_per_layer = 0.5  # GB
num_layers = 32
activation_memory = activation_memory_per_layer * num_layers

print(f'=== Qwen3-VL 8B 单卡显存估算 (不考虑并行) ===')
print(f'模型权重: {model_memory:.1f} GB')
print(f'优化器状态: {optimizer_memory:.1f} GB') 
print(f'梯度: {gradient_memory:.1f} GB')
print(f'激活内存: {activation_memory:.1f} GB (bs=1, len=4096)')
print(f'总计: {model_memory + optimizer_memory + gradient_memory + activation_memory:.1f} GB')
print()

# TP=4时的显存
print(f'=== TP=4, 分布式优化器时的单卡显存 ===')
tp = 4
print(f'模型权重: {model_memory/tp:.1f} GB')
print(f'优化器状态: {optimizer_memory/tp:.1f} GB (分布式优化器)')
print(f'梯度: {gradient_memory/tp:.1f} GB')
print(f'激活内存: {activation_memory:.1f} GB (激活不切分)')
print(f'总计: {(model_memory + optimizer_memory + gradient_memory)/tp + activation_memory:.1f} GB')
"

显存估算结果：


展开代码
=== Qwen3-VL 8B 单卡显存估算 (不考虑并行) ===
模型权重: 16.0 GB
优化器状态: 64.0 GB
梯度: 16.0 GB
激活内存: 16.0 GB (bs=1, len=4096)
总计: 112.0 GB

=== TP=4, 分布式优化器时的单卡显存 ===
模型权重: 4.0 GB
优化器状态: 16.0 GB (分布式优化器)
梯度: 4.0 GB
激活内存: 16.0 GB (激活不切分)
总计: 40.0 GB

根据GPU型号选择：

yaml
展开代码
# A100/H100 80GB：显存充足
recompute_granularity: null         # 或不设置
# 预计显存: ~40GB (TP=4) 或 ~45GB (TP=8)

# A100 40GB：显存紧张
recompute_granularity: selective    # 节省约30%激活内存
# 预计显存: ~30GB (TP=4)

# V100 32GB：显存非常紧张  
recompute_granularity: full         # 节省约50%激活内存
per_device_train_batch_size: 1      # 且必须bs=1
# 预计显存: ~25GB (TP=4)

3. gradient_accumulation_steps 调优

yaml
展开代码
# 目标：全局batch size = 64-128 (经验值)

# TP=4, DP=8
gradient_accumulation_steps: 8      # 全局bs = 8*1*8 = 64
# 或
gradient_accumulation_steps: 16     # 全局bs = 8*1*16 = 128

# TP=8, DP=4  
gradient_accumulation_steps: 8      # 全局bs = 4*2*8 = 64
per_device_train_batch_size: 2      # TP=8显存更多

四、完整推荐配置（TP=4方案）

yaml
展开代码
# === 模型和数据 ===
model_name_or_path: Qwen/Qwen3-VL-8B-Instruct
dataset_dir: /path/to/your/data
dataset: your_dataset
template: qwen3_vl
cutoff_len: 4096
image_max_pixels: 451584
video_max_pixels: 16384
trust_remote_code: true

# === 训练配置 ===
do_train: true
stage: sft
finetuning_type: full

# 多模态部位训练控制
freeze_vision_tower: false              # 训练视觉编码器
freeze_multi_modal_projector: false     # 训练投影层
freeze_language_model: false            # 训练语言模型

# === 超参数 ===
output_dir: saves/qwen3_vl_8b_4m32g
per_device_train_batch_size: 1          # 单卡batch
gradient_accumulation_steps: 8          # 累积步数
num_train_epochs: 3
learning_rate: 2e-5
lr_scheduler_type: cosine
warmup_ratio: 0.03
weight_decay: 0.01
bf16: true

# === 日志和保存 ===
logging_steps: 5
save_steps: 200
save_strategy: steps
save_total_limit: 3
evaluation_strategy: no                 # 或 steps
load_best_model_at_end: false
report_to: tensorboard

# === 数据加载 ===
preprocessing_num_workers: 16           # 4机32卡，多进程预处理
dataloader_num_workers: 4
dataloader_pin_memory: true
remove_unused_columns: false

# === 分布式配置 ===
ddp_timeout: 180000000                  # 跨节点超时时间
ddp_find_unused_parameters: false

# ===== Megatron-Core并行配置 =====
# DP=8, TP=4, PP=1, 总计32卡
tensor_model_parallel_size: 4           # 4路张量并行
pipeline_model_parallel_size: 1         # 不用流水线并行
sequence_parallel: false                # 4096长度下不需要

# === Megatron性能优化 ===
bias_activation_fusion: true            # 融合bias和激活
apply_rope_fusion: true                 # 融合RoPE计算
use_distributed_optimizer: true         # 分布式优化器（必须）
overlap_param_gather: true              # 参数预取（重要）
overlap_grad_reduce: true               # 梯度异步规约（重要）

# === 显存优化（根据GPU型号选择）===
# A100/H100 80GB: 不需要recompute
# recompute_granularity: null

# A100 40GB: 使用selective
# recompute_granularity: selective

# V100 32GB: 使用full
# recompute_granularity: full
# per_device_train_batch_size: 1

五、启动命令

1. 设置环境变量

bash
展开代码
export USE_MCA=1
export CUDA_DEVICE_MAX_CONNECTIONS=1  # Megatron推荐
export NCCL_IB_DISABLE=0              # 启用InfiniBand
export NCCL_NET_GDR_LEVEL=5           # GPU Direct RDMA
export NCCL_SOCKET_IFNAME=eth0        # 根据实际网卡名修改

2. 分布式启动

bash
展开代码
# 在主节点执行（node_rank=0）：
torchrun \
    --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=0 \
    --master_addr="主节点IP" \
    --master_port=29500 \
    $(which llamafactory-cli) train examples/megatron/qwen3_vl_32gpu.yaml

# 在其他3个节点分别执行：
# 节点1：node_rank=1
# 节点2：node_rank=2  
# 节点3：node_rank=3
torchrun \
    --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=1 \
    --master_addr="主节点IP" \
    --master_port=29500 \
    $(which llamafactory-cli) train examples/megatron/qwen3_vl_32gpu.yaml

六、性能预期

TP=4, DP=8配置：

预计吞吐：3000-4000 tokens/s/GPU
训练时间（10万样本）：约2-3小时
显存占用：40GB/GPU (A100 80GB充足)
对比单机8卡：
- 4倍数据并行加速（理想情况）
- 实际加速比：3.2-3.6x（考虑跨节点通信）

目录

分布式训练配置方案

一、硬件拓扑分析

二、推荐配置方案

🏆 方案1：TP=4, PP=1, DP=8（最推荐）

🥈 方案2：TP=8, PP=1, DP=4（显存更宽裕）

🥉 方案3：TP=2, PP=2, DP=8（平衡方案）

三、关键参数调优建议

1. sequence_parallel 设置

2. recompute_granularity 设置

3. gradient_accumulation_steps 调优

四、完整推荐配置（TP=4方案）

五、启动命令

1. 设置环境变量

2. 分布式启动

六、性能预期

TP=4, DP=8配置：

七、故障排查

1. 如果遇到OOM（显存不足）：

2. 如果速度慢：