Read file: src/llamafactory/data/converter.py Read file: src/llamafactory/data/parser.py
从 src/llamafactory/data/parser.py
第 47-53 行可以看出,LLaMA Factory 使用字段名映射而不是位置:
python展开代码# sharegpt tags
role_tag: Optional[str] = "from" # 默认映射 "from" 字段
content_tag: Optional[str] = "value" # 默认映射 "value" 字段
user_tag: Optional[str] = "human" # 默认映射 "human" 值
assistant_tag: Optional[str] = "gpt" # 默认映射 "gpt" 值
从 src/llamafactory/data/parser.py
第 85-90 行可以看出,这些字段名是可配置的:
python展开代码if "tags" in attr:
tag_names = ["role_tag", "content_tag"]
tag_names += ["user_tag", "assistant_tag", "observation_tag", "function_tag", "system_tag"]
for tag in tag_names:
self.set_attr(tag, attr["tags"])
在 src/llamafactory/data/converter.py
的 SharegptDatasetConverter
中:
python展开代码def __call__(self, example: dict[str, Any]) -> dict[str, Any]:
tag_mapping = {
self.dataset_attr.user_tag: Role.USER.value, # "human" -> "user"
self.dataset_attr.assistant_tag: Role.ASSISTANT.value, # "gpt" -> "assistant"
# ...
}
for turn_idx, message in enumerate(messages):
# 使用配置的字段名读取
if message[self.dataset_attr.role_tag] not in accept_tags[turn_idx % 2]:
# ...
aligned_messages.append({
"role": tag_mapping[message[self.dataset_attr.role_tag]], # 映射角色
"content": message[self.dataset_attr.content_tag], # 映射内容
})
不是的! 字段名和值都有特定要求:
"from"
,但可以通过 tags
配置修改"human"
和 "gpt"
,但也可以通过 tags
配置修改标准 ShareGPT 格式(推荐)
bash展开代码{
"conversations": [
{
"from": "human",
"value": "<image>Write a detailed description of this image, do not forget about the texts on it if they exist. Also, do not forget to mention the type / style of the image. No bullet points. When writing descriptions, prioritize clarity and direct observation over embellishment or interpretation.\nDon't forget these rules:\n1. **Be Direct and Concise**: Provide straightforward descriptions without adding interpretative or speculative elements.\n2. **Use Segmented Details**: Break down details about different elements of an image into distinct sentences, focusing on one aspect at a time.\n3. **Maintain a Descriptive Focus**: Prioritize purely visible elements of the image, avoiding conclusions or inferences.\n4. **Follow a Logical Structure**: Begin with the central figure or subject and expand outward, detailing its appearance before addressing the surrounding setting.\n5. **Avoid Juxtaposition**: Do not use comparison or contrast language; keep the description purely factual.\n6. **Incorporate Specificity**: Mention age, gender, race, and specific brands or notable features when present, and clearly identify the medium if it's discernible."
},
{
"from": "gpt",
"value": "The image is of a pair of silver earrings. Each earring features a rectangular silver frame with an oval purple gemstone in the center. The frame has an ornate design with a sunburst pattern at the top and a spike-like protrusion at the bottom. The earrings are made of silver and have a hook-style backing."
}
],
"images": [
"000443568.jpg"
]
}
标准 ShareGPT 格式(推荐)的 dataset_info.json
可以这样配置:
json展开代码{
"my_vlm_dataset": {
"file_name": "data.json",
"formatting": "sharegpt",
"columns": {
"messages": "conversations",
"images": "images"
},
"tags": {
"role_tag": "from", // 角色字段名
"content_tag": "value", // 内容字段名
"user_tag": "human", // 用户角色值
"assistant_tag": "gpt" // 助手角色值
}
}
}
比如你想用 "role"
和 "content"
字段,值用 "user"
和 "assistant"
:
json展开代码{
"my_vlm_dataset": {
"file_name": "data.json",
"formatting": "sharegpt",
"columns": {
"messages": "conversations",
"images": "images"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant"
}
}
}
然后你的数据格式就变成:
json展开代码{
"conversations": [
{
"role": "user",
"content": "<image>请描述这张图片..."
},
{
"role": "assistant",
"content": "这是一对银耳环..."
}
],
"images": ["000443568.jpg"]
}
LLaMA Factory 不是依靠位置读取,而是依靠字段名映射。字段名和值都可以通过 tags
配置自定义,但必须保持一致性。
本文作者:Dong
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!