LLaMA Factory + ShareGPT + VLM + 指令监督微调格式

详细分析：ShareGPT + VLM + 指令监督微调格式

1. 你的数据格式分析

你提供的数据格式：

json
展开代码
{
    "image": "000443568.jpg",
    "width": 700,
    "height": 700,
    "conversations": [
        {
            "from": "human",
            "value": "Write a detailed description of this image..."
        },
        {
            "from": "gpt",
            "value": "The image is of a pair of silver earrings..."
        }
    ]
}

问题分析：

缺少 <image> 标记：在 conversations 中没有 <image> 占位符
图片路径格式：image 字段是单个字符串，但需要与对话中的 <image> 标记对应
额外字段：width 和 height 字段在 LLaMA-Factory 中不会被使用

2. 正确的 ShareGPT + VLM 格式

根据代码分析，正确的格式应该是：

格式 A：标准 ShareGPT 格式（推荐）

json
展开代码
{
    "conversations": [
        {
            "from": "human",
            "value": "<image>Write a detailed description of this image, do not forget about the texts on it if they exist. Also, do not forget to mention the type / style of the image. No bullet points. When writing descriptions, prioritize clarity and direct observation over embellishment or interpretation.\nDon't forget these rules:\n1. **Be Direct and Concise**: Provide straightforward descriptions without adding interpretative or speculative elements.\n2. **Use Segmented Details**: Break down details about different elements of an image into distinct sentences, focusing on one aspect at a time.\n3. **Maintain a Descriptive Focus**: Prioritize purely visible elements of the image, avoiding conclusions or inferences.\n4. **Follow a Logical Structure**: Begin with the central figure or subject and expand outward, detailing its appearance before addressing the surrounding setting.\n5. **Avoid Juxtaposition**: Do not use comparison or contrast language; keep the description purely factual.\n6. **Incorporate Specificity**: Mention age, gender, race, and specific brands or notable features when present, and clearly identify the medium if it's discernible."
        },
        {
            "from": "gpt",
            "value": "The image is of a pair of silver earrings. Each earring features a rectangular silver frame with an oval purple gemstone in the center. The frame has an ornate design with a sunburst pattern at the top and a spike-like protrusion at the bottom. The earrings are made of silver and have a hook-style backing."
        }
    ],
    "images": [
        "000443568.jpg"
    ]
}

格式 B：OpenAI 格式

json
展开代码
{
    "messages": [
        {
            "role": "user",
            "content": "<image>Write a detailed description of this image..."
        },
        {
            "role": "assistant", 
            "content": "The image is of a pair of silver earrings..."
        }
    ],
    "images": [
        "000443568.jpg"
    ]
}

3. 对应的 dataset_info.json 配置

配置 A：标准 ShareGPT 格式

json
展开代码
{
    "my_vlm_dataset": {
        "file_name": "data.json",
        "formatting": "sharegpt",
        "columns": {
            "messages": "conversations",
            "images": "images"
        }
    }
}

配置 B：OpenAI 格式

json
展开代码
{
    "my_vlm_dataset": {
        "file_name": "data.json",
        "formatting": "sharegpt",
        "columns": {
            "messages": "messages",
            "images": "images"
        },
        "tags": {
            "role_tag": "role",
            "content_tag": "content",
            "user_tag": "user",
            "assistant_tag": "assistant"
        }
    }
}

4. 关键要求说明

4.1 图片标记要求

必须使用 <image> 标记：在对话文本中必须包含 <image> 占位符
数量匹配：<image> 标记的数量必须与 images 数组中的图片数量严格一致
位置灵活：<image> 可以出现在文本的任何位置

4.2 图片路径支持

相对路径：相对于 media_dir 的路径
绝对路径：直接使用绝对路径
云存储路径：支持 s3://、gs://、gcs:// 格式

4.3 对话格式要求

角色交替：human 和 gpt 必须交替出现
奇数位置：human 和 observation 必须在奇数位置
偶数位置：gpt 和 function 必须在偶数位置

5. 代码实现细节

从 src/llamafactory/data/converter.py 的 SharegptDatasetConverter 类可以看出：

python
展开代码
def __call__(self, example: dict[str, Any]) -> dict[str, Any]:
    # 处理对话消息
    messages = example[self.dataset_attr.messages]
    
    # 处理图片路径
    "_images": self._find_medias(example[self.dataset_attr.images]) if self.dataset_attr.images else None,

6. 完整的数据集示例

json
展开代码
[
    {
        "conversations": [
            {
                "from": "human",
                "value": "<image>Describe this image in detail."
            },
            {
                "from": "gpt",
                "value": "The image shows a beautiful landscape..."
            }
        ],
        "images": [
            "/path/to/image1.jpg"
        ]
    },
    {
        "conversations": [
            {
                "from": "human", 
                "value": "<image><image>Compare these two images."
            },
            {
                "from": "gpt",
                "value": "Both images show different aspects..."
            }
        ],
        "images": [
            "/path/to/image2.jpg",
            "/path/to/image3.jpg"
        ]
    }
]

7. 训练命令示例

bash
展开代码
python src/train_bash.py \
    --stage sft \
    --model_name_or_path /path/to/model \
    --dataset my_vlm_dataset \
    --dataset_dir /path/to/data \
    --media_dir /path/to/images \
    --template qwen2_vl \
    --output_dir /path/to/output \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3 \
    --max_samples 1000 \
    --cutoff_len 2048 \
    --dataloader_num_workers 8

总结

你的数据格式需要做以下修改：

添加 <image> 标记到对话文本中
将 image 字段改为 images 数组
移除不必要的 width 和 height 字段
确保图片数量与 <image> 标记数量一致

这样修改后，数据就符合 LLaMA-Factory 的 ShareGPT + VLM 训练格式要求了。

目录

详细分析：ShareGPT + VLM + 指令监督微调格式

1. 你的数据格式分析

2. 正确的 ShareGPT + VLM 格式

格式 A：标准 ShareGPT 格式（推荐）

格式 B：OpenAI 格式

3. 对应的 dataset_info.json 配置

配置 A：标准 ShareGPT 格式

配置 B：OpenAI 格式

4. 关键要求说明

4.1 图片标记要求

4.2 图片路径支持

4.3 对话格式要求

5. 代码实现细节

6. 完整的数据集示例

7. 训练命令示例

总结