LLaMA Factory vlm 训练用的标准ShareGPT数据的Tags标记规则
2025-07-19
深度学习
00

目录

LLaMA Factory 不是依靠位置读取的,而是依靠字段名映射!
关键发现:
1. 字段名映射机制
2. 可配置的字段名
3. 处理逻辑
回答问题:
"from": "human", "from": "gpt" 这两行随便叫啥无所谓吗?
示例配置
如果你想使用不同的字段名和值
总结

Read file: src/llamafactory/data/converter.py Read file: src/llamafactory/data/parser.py

LLaMA Factory 不是依靠位置读取的,而是依靠字段名映射!

关键发现:

1. 字段名映射机制

src/llamafactory/data/parser.py 第 47-53 行可以看出,LLaMA Factory 使用字段名映射而不是位置:

python
展开代码
# sharegpt tags role_tag: Optional[str] = "from" # 默认映射 "from" 字段 content_tag: Optional[str] = "value" # 默认映射 "value" 字段 user_tag: Optional[str] = "human" # 默认映射 "human" 值 assistant_tag: Optional[str] = "gpt" # 默认映射 "gpt" 值

2. 可配置的字段名

src/llamafactory/data/parser.py 第 85-90 行可以看出,这些字段名是可配置的

python
展开代码
if "tags" in attr: tag_names = ["role_tag", "content_tag"] tag_names += ["user_tag", "assistant_tag", "observation_tag", "function_tag", "system_tag"] for tag in tag_names: self.set_attr(tag, attr["tags"])

3. 处理逻辑

src/llamafactory/data/converter.pySharegptDatasetConverter 中:

python
展开代码
def __call__(self, example: dict[str, Any]) -> dict[str, Any]: tag_mapping = { self.dataset_attr.user_tag: Role.USER.value, # "human" -> "user" self.dataset_attr.assistant_tag: Role.ASSISTANT.value, # "gpt" -> "assistant" # ... } for turn_idx, message in enumerate(messages): # 使用配置的字段名读取 if message[self.dataset_attr.role_tag] not in accept_tags[turn_idx % 2]: # ... aligned_messages.append({ "role": tag_mapping[message[self.dataset_attr.role_tag]], # 映射角色 "content": message[self.dataset_attr.content_tag], # 映射内容 })

回答问题:

"from": "human", "from": "gpt" 这两行随便叫啥无所谓吗?

不是的! 字段名和值都有特定要求:

  1. 字段名:默认是 "from",但可以通过 tags 配置修改
  2. 字段值:默认是 "human""gpt",但也可以通过 tags 配置修改

示例配置

标准 ShareGPT 格式(推荐)

bash
展开代码
{ "conversations": [ { "from": "human", "value": "<image>Write a detailed description of this image, do not forget about the texts on it if they exist. Also, do not forget to mention the type / style of the image. No bullet points. When writing descriptions, prioritize clarity and direct observation over embellishment or interpretation.\nDon't forget these rules:\n1. **Be Direct and Concise**: Provide straightforward descriptions without adding interpretative or speculative elements.\n2. **Use Segmented Details**: Break down details about different elements of an image into distinct sentences, focusing on one aspect at a time.\n3. **Maintain a Descriptive Focus**: Prioritize purely visible elements of the image, avoiding conclusions or inferences.\n4. **Follow a Logical Structure**: Begin with the central figure or subject and expand outward, detailing its appearance before addressing the surrounding setting.\n5. **Avoid Juxtaposition**: Do not use comparison or contrast language; keep the description purely factual.\n6. **Incorporate Specificity**: Mention age, gender, race, and specific brands or notable features when present, and clearly identify the medium if it's discernible." }, { "from": "gpt", "value": "The image is of a pair of silver earrings. Each earring features a rectangular silver frame with an oval purple gemstone in the center. The frame has an ornate design with a sunburst pattern at the top and a spike-like protrusion at the bottom. The earrings are made of silver and have a hook-style backing." } ], "images": [ "000443568.jpg" ] }

标准 ShareGPT 格式(推荐)的 dataset_info.json 可以这样配置:

json
展开代码
{ "my_vlm_dataset": { "file_name": "data.json", "formatting": "sharegpt", "columns": { "messages": "conversations", "images": "images" }, "tags": { "role_tag": "from", // 角色字段名 "content_tag": "value", // 内容字段名 "user_tag": "human", // 用户角色值 "assistant_tag": "gpt" // 助手角色值 } } }

如果你想使用不同的字段名和值

比如你想用 "role""content" 字段,值用 "user""assistant"

json
展开代码
{ "my_vlm_dataset": { "file_name": "data.json", "formatting": "sharegpt", "columns": { "messages": "conversations", "images": "images" }, "tags": { "role_tag": "role", "content_tag": "content", "user_tag": "user", "assistant_tag": "assistant" } } }

然后你的数据格式就变成:

json
展开代码
{ "conversations": [ { "role": "user", "content": "<image>请描述这张图片..." }, { "role": "assistant", "content": "这是一对银耳环..." } ], "images": ["000443568.jpg"] }

总结

LLaMA Factory 不是依靠位置读取,而是依靠字段名映射。字段名和值都可以通过 tags 配置自定义,但必须保持一致性。

如果对你有用的话,可以打赏哦
打赏
ali pay
wechat pay

本文作者:Dong

本文链接:

版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!