LLaMA Factory vlm 训练用的标准ShareGPT数据的Tags标记规则

Read file: src/llamafactory/data/converter.py Read file: src/llamafactory/data/parser.py

LLaMA Factory 不是依靠位置读取的，而是依靠字段名映射！

关键发现：

1. 字段名映射机制

从 src/llamafactory/data/parser.py 第 47-53 行可以看出，LLaMA Factory 使用字段名映射而不是位置：

python
展开代码
# sharegpt tags
role_tag: Optional[str] = "from"        # 默认映射 "from" 字段
content_tag: Optional[str] = "value"    # 默认映射 "value" 字段
user_tag: Optional[str] = "human"       # 默认映射 "human" 值
assistant_tag: Optional[str] = "gpt"    # 默认映射 "gpt" 值

2. 可配置的字段名

从 src/llamafactory/data/parser.py 第 85-90 行可以看出，这些字段名是可配置的：

python
展开代码
if "tags" in attr:
    tag_names = ["role_tag", "content_tag"]
    tag_names += ["user_tag", "assistant_tag", "observation_tag", "function_tag", "system_tag"]
    for tag in tag_names:
        self.set_attr(tag, attr["tags"])

3. 处理逻辑

在 src/llamafactory/data/converter.py 的 SharegptDatasetConverter 中：

python
展开代码
def __call__(self, example: dict[str, Any]) -> dict[str, Any]:
    tag_mapping = {
        self.dataset_attr.user_tag: Role.USER.value,        # "human" -> "user"
        self.dataset_attr.assistant_tag: Role.ASSISTANT.value, # "gpt" -> "assistant"
        # ...
    }
    
    for turn_idx, message in enumerate(messages):
        # 使用配置的字段名读取
        if message[self.dataset_attr.role_tag] not in accept_tags[turn_idx % 2]:
            # ...
        
        aligned_messages.append({
            "role": tag_mapping[message[self.dataset_attr.role_tag]],  # 映射角色
            "content": message[self.dataset_attr.content_tag],         # 映射内容
        })

回答问题：

"from": "human", "from": "gpt" 这两行随便叫啥无所谓吗？

不是的！ 字段名和值都有特定要求：

字段名：默认是 "from"，但可以通过 tags 配置修改
字段值：默认是 "human" 和 "gpt"，但也可以通过 tags 配置修改

示例配置

标准 ShareGPT 格式（推荐）

bash
展开代码
{
    "conversations": [
        {
            "from": "human",
            "value": "<image>Write a detailed description of this image, do not forget about the texts on it if they exist. Also, do not forget to mention the type / style of the image. No bullet points. When writing descriptions, prioritize clarity and direct observation over embellishment or interpretation.\nDon't forget these rules:\n1. **Be Direct and Concise**: Provide straightforward descriptions without adding interpretative or speculative elements.\n2. **Use Segmented Details**: Break down details about different elements of an image into distinct sentences, focusing on one aspect at a time.\n3. **Maintain a Descriptive Focus**: Prioritize purely visible elements of the image, avoiding conclusions or inferences.\n4. **Follow a Logical Structure**: Begin with the central figure or subject and expand outward, detailing its appearance before addressing the surrounding setting.\n5. **Avoid Juxtaposition**: Do not use comparison or contrast language; keep the description purely factual.\n6. **Incorporate Specificity**: Mention age, gender, race, and specific brands or notable features when present, and clearly identify the medium if it's discernible."
        },
        {
            "from": "gpt",
            "value": "The image is of a pair of silver earrings. Each earring features a rectangular silver frame with an oval purple gemstone in the center. The frame has an ornate design with a sunburst pattern at the top and a spike-like protrusion at the bottom. The earrings are made of silver and have a hook-style backing."
        }
    ],
    "images": [
        "000443568.jpg"
    ]
}

标准 ShareGPT 格式（推荐）的 dataset_info.json 可以这样配置：

json
展开代码
{
    "my_vlm_dataset": {
        "file_name": "data.json",
        "formatting": "sharegpt",
        "columns": {
            "messages": "conversations",
            "images": "images"
        },
        "tags": {
            "role_tag": "from",           // 角色字段名
            "content_tag": "value",       // 内容字段名
            "user_tag": "human",          // 用户角色值
            "assistant_tag": "gpt"        // 助手角色值
        }
    }
}

如果你想使用不同的字段名和值

比如你想用 "role" 和 "content" 字段，值用 "user" 和 "assistant"：

json
展开代码
{
    "my_vlm_dataset": {
        "file_name": "data.json",
        "formatting": "sharegpt",
        "columns": {
            "messages": "conversations",
            "images": "images"
        },
        "tags": {
            "role_tag": "role",
            "content_tag": "content", 
            "user_tag": "user",
            "assistant_tag": "assistant"
        }
    }
}

然后你的数据格式就变成：

json
展开代码
{
    "conversations": [
        {
            "role": "user",
            "content": "<image>请描述这张图片..."
        },
        {
            "role": "assistant", 
            "content": "这是一对银耳环..."
        }
    ],
    "images": ["000443568.jpg"]
}

总结

LLaMA Factory 不是依靠位置读取，而是依靠字段名映射。字段名和值都可以通过 tags 配置自定义，但必须保持一致性。

目录