更多信息可直接访问官网:
https://internvl.readthedocs.io/en/latest/internvl2.0/finetune.html
训练脚本internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_full.sh
kevinchina/deeplearning:trainintervl-of
Meta File
{ "your-custom-dataset-1": { "root": "path/to/the/image/", "annotation": "path/to/the/jsonl/annotation", "data_augment": false, "max_dynamic_patch": 12, "repeat_time": 1, "length": "number of samples in the dataset" }, ... }
单图样本例子:
{ "id": 0, "image": "images/00000000.jpg", "width": 897, "height": 1152, "conversations": [ { "from": "human", "value": "<image>\nCan you extract any readable text from the image?" }, { "from": "gpt", "value": "Dares Wins Vol. 5 Tommy's Heroes Vol. 6: For Tomorrow Vol. 7: Closing Time miniseries. Clark Kent is being interviewed about Superman's connection to notorious killer Tommy Monaghan. Taking the conversation..." } ] }
检测框数据:
<ref>class name</ref><box>[[x1, y1, x2, y2], ...]</box>
最终我的converted_dataset.jsonl文件里面:
{"id": 0, "image": "/img_datasets/img_small_size/didichuxing-20240914171548.jpg", "width": 447, "height": 1000, "conversations": [{"from": "human", "value": "<image>点[56,257]所处位置的信息是什么?"}, {"from": "gpt", "value": "<ref>文本-地址</ref><box>[[33, 239, 66, 262]]</box>"}]} {"id": 1, "image": "/img_datasets/img_small_size/didichuxing-20240914171548.jpg", "width": 447, "height": 1000, "conversations": [{"from": "human", "value": "<image>点[152,254]所处位置的信息是什么?"}, {"from": "gpt", "value": "<ref>按钮-打开请填写地址</ref><box>[[10, 213, 431, 286]]</box>"}]} {"id": 2, "image": "/img_datasets/img_small_size/didichuxing-20240914171548.jpg", "width": 447, "height": 1000, "conversations": [{"from": "human", "value": "<image>点[364,244]所处位置的信息是什么?"}, {"from": "gpt", "value": "<ref>按钮-地图选址</ref><box>[[348, 241, 405, 263]]</box>"}]}
最终我的converted_dataset_meta文件:
{ "point_to_box": { "root": "/", "annotation": "/meta_jsonl/converted_dataset.jsonl", "data_augment": false, "repeat_time": 1, "length": 797760 } }
开启训练:
cd /app/InternVL/internvl_chat && \ sh shell/internvl2.0/3nd_finetune/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_full.sh
internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_full.sh
bashset -x
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export MASTER_PORT=34229
export TF_CPP_MIN_LOG_LEVEL=3
export LAUNCHER=pytorch
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
GPUS_PER_NODE=$(/root/miniconda3/envs/opensora/bin/python -c 'import torch; print(torch.cuda.device_count())')
OUTPUT_DIR='/app/InternVL/internvl_chat/work_dirs/internvl_chat_v2_0/'
LOGS_DIR="/tmp/logs/"
mkdir -p $LOGS_DIR
if [ ! -d "$OUTPUT_DIR" ]; then
mkdir -p "$OUTPUT_DIR"
fi
BATCH_SIZE=${BATCH_SIZE:-640}
PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-8}
GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / GPUS_PER_NODE))
MASTER_ADDR=${MASTER_ADDR:-localhost}
MASTER_PORT=${MASTER_PORT:-6001}
DISTRIBUTED_ARGS="
--nproc-per-node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
# epoch: 1
/root/miniconda3/envs/opensora/bin/torchrun $DISTRIBUTED_ARGS \
internvl/train/internvl_chat_finetune.py \
--model_name_or_path "/internvl2_8b" \
--conv_style "internlm2-chat" \
--output_dir ${OUTPUT_DIR} \
--meta_path "./shell/data/internvl_1_2_finetune_custom.json" \
--overwrite_output_dir True \
--force_image_size 448 \
--max_dynamic_patch 6 \
--down_sample_ratio 0.5 \
--drop_path_rate 0.1 \
--freeze_llm False \
--freeze_mlp False \
--freeze_backbone False \
--vision_select_layer -1 \
--dataloader_num_workers 8 \
--bf16 True \
--num_train_epochs 6 \
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE} \
--gradient_accumulation_steps ${GRADIENT_ACC} \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 200 \
--save_total_limit 1 \
--learning_rate 4e-5 \
--weight_decay 0.01 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--max_seq_length 4096 \
--do_train True \
--grad_checkpoint True \
--group_by_length True \
--dynamic_image_size True \
--use_thumbnail True \
--ps_version 'v2' \
--deepspeed "zero_stage1_config.json" \
--report_to "tensorboard" \
2>&1 | tee -a "${LOGS_DIR}/training_log.txt"
internvl_1_2_finetune_custom.json
json{
"point_to_box": {
"root": "/",
"annotation": "/meta_jsonl/converted_dataset.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 797760
}
}
docker run -it \ -v /ssd/xiedong/qwenvl_train_ui_ground_datasets:/ssd/xiedong/qwenvl_train_ui_ground_datasets \ -v /ssd/xiedong/internvl8byangfan:/model \ --net host \ --gpus '"device=1"' \ kevinchina/deeplearning:trainintervl-of bash
执行模型推理...
当前进度: 处理了 400/2000 个有效样本 IoU > 0.3 的正确率: 1.0000 点在框内的正确率: 1.0000
训练的新模型感觉效果不错。
API访问方式:
python x18_request_ui_point_query.py "点[56,257]所处位置的信息是什么?" /ssd/xiedong/qwenvl_train_ui_ground_datasets/img_small_size/didichuxing-20240914171548.jpg
Sending request with image file: /ssd/xiedong/qwenvl_train_ui_ground_datasets/img_small_size/didichuxing-20240914171548.jpg Request successful. Status code: 200
API Response: { "box": [ 33, 239, 66, 262 ], "code": 0, "element_description": "文本-地址", "imgurl": null, "point": [ 56, 257 ], "response": "文本-地址[[33, 239, 66, 262]]", "text": "succeed" }
训练样本:{"id": 0, "image": "/img_datasets/img_small_size/didichuxing-20240914171548.jpg", "width": 447, "height": 1000, "conversations": [{"from": "human", "value": "<image>点[56,257]所处位置的信息是什么?"}, {"from": "gpt", "value": "<ref>文本-地址</ref><box>[[33, 239, 66, 262]]</box>"}]} 期望模型输出:"<ref>文本-地址</ref><box>[[33, 239, 66, 262]]</box>"
ref 和 box 是特殊标签,模型返回的时候会去掉这个。
真实模型返回是: "response": "文本-地址[[33, 239, 66, 262]]"
本文作者:Dong
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!