LLaMA Factory 训练数据的图片token插入位置

1. 训练阶段：图片 token 插入位置

是的，图片通过编码器编码为 token 后，插入的位置就是 <image> 标记的位置。

具体实现机制：

Qwen2VL 的处理流程：

占位符替换：在 Qwen2VLPlugin.process_messages 中（第 1630-1640 行）：

python
展开代码
while IMAGE_PLACEHOLDER in content:
    image_seqlen = image_grid_thw[num_image_tokens].prod() // merge_length if self.expand_mm_tokens else 1
    content = content.replace(
        IMAGE_PLACEHOLDER, f"<|vision_start|>{self.image_token * image_seqlen}<|vision_end|>", 1
    )
    num_image_tokens += 1

Token 替换：在 Qwen2VLProcessor.__call__ 中（第 150-170 行）：

python
展开代码
while self.image_token in text[i]:
    num_image_tokens = image_grid_thw[index].prod() // merge_length
    text[i] = text[i].replace(self.image_token, "<|placeholder|>" * num_image_tokens, 1)
    index += 1
text[i] = text[i].replace("<|placeholder|>", self.image_token)

特征插入：在 Qwen2VLModel.forward 中（第 1026-1065 行），通过 masked_scatter 将图片特征插入到对应的 token 位置。

关键点：

位置精确对应：图片 token 严格按照 <image> 标记在文本中的位置插入
顺序匹配：多个图片按 <image> 标记出现的顺序依次处理
特征替换：图片特征直接替换对应的占位符 token

2. 推理阶段：图片 token 位置

推理阶段需要注意图片 token 的位置，但不需要固定在前或后。

VLLM 部署的处理：

1. VLLM 引擎处理（`src/llamafactory/chat/vllm_engine.py` 第 101-136 行）：

python
展开代码
if images is not None and not any(IMAGE_PLACEHOLDER in message["content"] for message in messages):
    messages[0]["content"] = IMAGE_PLACEHOLDER * len(images) + messages[0]["content"]

2. 多模态数据处理（`scripts/vllm_infer.py` 第 140-171 行）：

python
展开代码
multi_modal_data = {
    "image": template_obj.mm_plugin._regularize_images(
        image, image_max_pixels=image_max_pixels, image_min_pixels=image_min_pixels
    )["images"]
}
vllm_inputs.append({"prompt_token_ids": batch["input_ids"][j], "multi_modal_data": multi_modal_data})

目录