编辑
2025-04-19
深度学习
00

目录

1. 镜像
2. 数据
最终
3. 训练开启
4. 推理

更多信息可直接访问官网:

https://internvl.readthedocs.io/en/latest/internvl2.0/finetune.html

1. 镜像

训练脚本internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_full.sh

kevinchina/deeplearning:trainintervl-of

2. 数据

Meta File

{ "your-custom-dataset-1": { "root": "path/to/the/image/", "annotation": "path/to/the/jsonl/annotation", "data_augment": false, "max_dynamic_patch": 12, "repeat_time": 1, "length": "number of samples in the dataset" }, ... }

单图样本例子:

{ "id": 0, "image": "images/00000000.jpg", "width": 897, "height": 1152, "conversations": [ { "from": "human", "value": "<image>\nCan you extract any readable text from the image?" }, { "from": "gpt", "value": "Dares Wins Vol. 5 Tommy's Heroes Vol. 6: For Tomorrow Vol. 7: Closing Time miniseries. Clark Kent is being interviewed about Superman's connection to notorious killer Tommy Monaghan. Taking the conversation..." } ] }

检测框数据:

<ref>class name</ref><box>[[x1, y1, x2, y2], ...]</box>

最终

最终我的converted_dataset.jsonl文件里面:

{"id": 0, "image": "/img_datasets/img_small_size/didichuxing-20240914171548.jpg", "width": 447, "height": 1000, "conversations": [{"from": "human", "value": "<image>点[56,257]所处位置的信息是什么?"}, {"from": "gpt", "value": "<ref>文本-地址</ref><box>[[33, 239, 66, 262]]</box>"}]} {"id": 1, "image": "/img_datasets/img_small_size/didichuxing-20240914171548.jpg", "width": 447, "height": 1000, "conversations": [{"from": "human", "value": "<image>点[152,254]所处位置的信息是什么?"}, {"from": "gpt", "value": "<ref>按钮-打开请填写地址</ref><box>[[10, 213, 431, 286]]</box>"}]} {"id": 2, "image": "/img_datasets/img_small_size/didichuxing-20240914171548.jpg", "width": 447, "height": 1000, "conversations": [{"from": "human", "value": "<image>点[364,244]所处位置的信息是什么?"}, {"from": "gpt", "value": "<ref>按钮-地图选址</ref><box>[[348, 241, 405, 263]]</box>"}]}

最终我的converted_dataset_meta文件:

{ "point_to_box": { "root": "/", "annotation": "/meta_jsonl/converted_dataset.jsonl", "data_augment": false, "repeat_time": 1, "length": 797760 } }

3. 训练开启

开启训练:

cd /app/InternVL/internvl_chat && \ sh shell/internvl2.0/3nd_finetune/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_full.sh

internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_full.sh

bash
set -x export PYTHONPATH="${PYTHONPATH}:$(pwd)" export MASTER_PORT=34229 export TF_CPP_MIN_LOG_LEVEL=3 export LAUNCHER=pytorch export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True GPUS_PER_NODE=$(/root/miniconda3/envs/opensora/bin/python -c 'import torch; print(torch.cuda.device_count())') OUTPUT_DIR='/app/InternVL/internvl_chat/work_dirs/internvl_chat_v2_0/' LOGS_DIR="/tmp/logs/" mkdir -p $LOGS_DIR if [ ! -d "$OUTPUT_DIR" ]; then mkdir -p "$OUTPUT_DIR" fi BATCH_SIZE=${BATCH_SIZE:-640} PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-8} GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / GPUS_PER_NODE)) MASTER_ADDR=${MASTER_ADDR:-localhost} MASTER_PORT=${MASTER_PORT:-6001} DISTRIBUTED_ARGS=" --nproc-per-node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " # epoch: 1 /root/miniconda3/envs/opensora/bin/torchrun $DISTRIBUTED_ARGS \ internvl/train/internvl_chat_finetune.py \ --model_name_or_path "/internvl2_8b" \ --conv_style "internlm2-chat" \ --output_dir ${OUTPUT_DIR} \ --meta_path "./shell/data/internvl_1_2_finetune_custom.json" \ --overwrite_output_dir True \ --force_image_size 448 \ --max_dynamic_patch 6 \ --down_sample_ratio 0.5 \ --drop_path_rate 0.1 \ --freeze_llm False \ --freeze_mlp False \ --freeze_backbone False \ --vision_select_layer -1 \ --dataloader_num_workers 8 \ --bf16 True \ --num_train_epochs 6 \ --per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE} \ --gradient_accumulation_steps ${GRADIENT_ACC} \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 200 \ --save_total_limit 1 \ --learning_rate 4e-5 \ --weight_decay 0.01 \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --max_seq_length 4096 \ --do_train True \ --grad_checkpoint True \ --group_by_length True \ --dynamic_image_size True \ --use_thumbnail True \ --ps_version 'v2' \ --deepspeed "zero_stage1_config.json" \ --report_to "tensorboard" \ 2>&1 | tee -a "${LOGS_DIR}/training_log.txt"

internvl_1_2_finetune_custom.json

json
{ "point_to_box": { "root": "/", "annotation": "/meta_jsonl/converted_dataset.jsonl", "data_augment": false, "repeat_time": 1, "length": 797760 } }

4. 推理

docker run -it \ -v /ssd/xiedong/qwenvl_train_ui_ground_datasets:/ssd/xiedong/qwenvl_train_ui_ground_datasets \ -v /ssd/xiedong/internvl8byangfan:/model \ --net host \ --gpus '"device=1"' \ kevinchina/deeplearning:trainintervl-of bash

执行模型推理...

当前进度: 处理了 400/2000 个有效样本 IoU > 0.3 的正确率: 1.0000 点在框内的正确率: 1.0000

训练的新模型感觉效果不错。

API访问方式:

python x18_request_ui_point_query.py "点[56,257]所处位置的信息是什么?" /ssd/xiedong/qwenvl_train_ui_ground_datasets/img_small_size/didichuxing-20240914171548.jpg

Sending request with image file: /ssd/xiedong/qwenvl_train_ui_ground_datasets/img_small_size/didichuxing-20240914171548.jpg Request successful. Status code: 200

API Response: { "box": [ 33, 239, 66, 262 ], "code": 0, "element_description": "文本-地址", "imgurl": null, "point": [ 56, 257 ], "response": "文本-地址[[33, 239, 66, 262]]", "text": "succeed" }

训练样本:{"id": 0, "image": "/img_datasets/img_small_size/didichuxing-20240914171548.jpg", "width": 447, "height": 1000, "conversations": [{"from": "human", "value": "<image>点[56,257]所处位置的信息是什么?"}, {"from": "gpt", "value": "<ref>文本-地址</ref><box>[[33, 239, 66, 262]]</box>"}]} 期望模型输出:"<ref>文本-地址</ref><box>[[33, 239, 66, 262]]</box>"

ref 和 box 是特殊标签,模型返回的时候会去掉这个。

真实模型返回是: "response": "文本-地址[[33, 239, 66, 262]]"

如果对你有用的话,可以打赏哦
打赏
ali pay
wechat pay

本文作者:Dong

本文链接:

版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!