2024-10-11
深度学习
00

目录

项目概述
环境准备
相关资源链接
Docker镜像构建
服务启动
启动容器
启动API服务
InternVL2-40B-AWQ 测试和评估
单张图像API访问
多次测试平均响应时间
结果展示
InternVL2-Llama3-76B-AWQ 测试和评估
多次测试平均响应时间
结果展示
测试结果
InternVL2-26B-AWQ 测试和评估
多次测试平均响应时间
结果展示

项目概述

本项目旨在构建一个基于InternVL2-40B模型的多模态API服务。该服务使用Docker环境,利用lmdeploy库进行模型部署,并通过API接口接收图像和文本输入,生成描述性文本输出。

环境准备

相关资源链接

Docker镜像构建

使用以下Dockerfile构建环境,其中包含CUDA和依赖库配置。

dockerfile
ARG CUDA_VERSION=cu12 FROM openmmlab/lmdeploy:latest-cu12 AS cu12 ENV CUDA_VERSION_SHORT=cu123 FROM ${CUDA_VERSION} AS final RUN python3 -m pip install timm RUN python3 -m pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+${CUDA_VERSION_SHORT}torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

构建命令:

bash
docker build --build-arg CUDA_VERSION=cu12 -t kevinchina/deeplearning:internvl . docker push kevinchina/deeplearning:internvl

服务启动

启动容器

使用以下命令启动Docker容器:

bash
docker run --runtime nvidia --gpus all \ -v /root/xiedong/:/root/xiedong/ \ -p 23333:23333 \ --ipc=host \ -it --rm \ kevinchina/deeplearning:internvl bash

启动API服务

bash
lmdeploy serve api_server /root/xiedong/InternVL2-40B-AWQ \ --backend turbomind --server-port 23333 --model-format awq

InternVL2-40B-AWQ 测试和评估

单张图像API访问

使用以下Python代码调用API,传入图像URL,并接收描述性文本输出:

python
from openai import OpenAI # 初始化OpenAI客户端 client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1') # 获取模型名称 # model_name = client.models.list().data[0].id # # print(model_name) # 发送请求并获取响应 response = client.chat.completions.create( model="/root/xiedong/InternVL2-40B-AWQ", messages=[{ 'role': 'user', 'content': [{ 'type': 'text', 'text': 'Describe the image please', }, { 'type': 'image_url', 'image_url': { 'url': '/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1024.jpeg', }, }], }], temperature=0.8, top_p=0.8 ) # 打印返回的文本内容 print(response.choices[0].message.content)

多次测试平均响应时间

对不同尺寸的图片进行10次请求并计算平均响应时间,代码如下:

python
import time from openai import OpenAI # 初始化OpenAI客户端 client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1') # 定义图像路径 image_paths = [ "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo256.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo512.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo768.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1024.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1280.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo2560.jpeg" ] # 设置请求次数 num_requests = 10 # 存储每个图像的平均响应时间 average_speeds = {} # 遍历每张图片 for image_path in image_paths: total_time = 0 # 对每张图片执行 num_requests 次请求 for _ in range(num_requests): start_time = time.time() # 发送请求并获取响应 response = client.chat.completions.create( model="/root/xiedong/InternVL2-40B-AWQ", messages=[{ 'role': 'user', 'content': [{ 'type': 'text', 'text': 'Describe the image please', }, { 'type': 'image_url', 'image_url': { 'url': image_path, }, }], }], temperature=0.8, top_p=0.8 ) # 记录响应时间 elapsed_time = time.time() - start_time total_time += elapsed_time # 打印当前请求的响应内容(可选) print(f"Response for {image_path}: {response.choices[0].message.content}") # 计算并记录该图像的平均响应时间 average_speed = total_time / num_requests average_speeds[image_path] = average_speed print(f"Average speed for {image_path}: {average_speed} seconds") # 输出所有图像的平均响应时间 for image_path, avg_speed in average_speeds.items(): print(f"{image_path}: {avg_speed:.2f} seconds")

显存占用 72396MiB 。

结果展示

  • 384 x 256

  • 768 x 512

  • 1152 x 768

  • 1536 x 1024

  • 1920 x 1280

  • 3840 x 2560

  • demo256.jpeg: 3.39 seconds

  • demo512.jpeg: 3.32 seconds

  • demo768.jpeg: 3.43 seconds

  • demo1024.jpeg: 3.65 seconds

  • demo1280.jpeg: 3.40 seconds

  • demo2560.jpeg: 3.99 seconds

测试结果表明,随着图片分辨率的提升,响应时间相应增加。

InternVL2-Llama3-76B-AWQ 测试和评估

启动容器

使用以下命令启动Docker容器:

bash
docker run --runtime nvidia --gpus all \ -v /root/xiedong/:/root/xiedong/ \ -p 23333:23333 \ --ipc=host \ -it --rm \ kevinchina/deeplearning:internvl bash

启动API服务

bash
lmdeploy serve api_server /root/xiedong/InternVL2-Llama3-76B-AWQ \ --backend turbomind --server-port 23333 --model-format awq

多次测试平均响应时间

对不同尺寸的图片进行10次请求并计算平均响应时间,代码如下:

python
import time from openai import OpenAI # 初始化OpenAI客户端 client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1') # 定义图像路径 image_paths = [ "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo256.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo512.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo768.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1024.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1280.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo2560.jpeg" ] # 设置请求次数 num_requests = 10 # 存储每个图像的平均响应时间 average_speeds = {} # 遍历每张图片 for image_path in image_paths: total_time = 0 # 对每张图片执行 num_requests 次请求 for _ in range(num_requests): start_time = time.time() # 发送请求并获取响应 response = client.chat.completions.create( model="/root/xiedong/InternVL2-Llama3-76B-AWQ", messages=[{ 'role': 'user', 'content': [{ 'type': 'text', 'text': 'Describe the image please', }, { 'type': 'image_url', 'image_url': { 'url': image_path, }, }], }], temperature=0.8, top_p=0.8 ) # 记录响应时间 elapsed_time = time.time() - start_time total_time += elapsed_time # 打印当前请求的响应内容(可选) print(f"Response for {image_path}: {response.choices[0].message.content}") # 计算并记录该图像的平均响应时间 average_speed = total_time / num_requests average_speeds[image_path] = average_speed print(f"Average speed for {image_path}: {average_speed} seconds") # 输出所有图像的平均响应时间 for image_path, avg_speed in average_speeds.items(): print(f"{image_path}: {avg_speed:.2f} seconds")

显存占用76967MiB:

image.png

结果展示

  • 384 x 256
  • 768 x 512
  • 1152 x 768
  • 1536 x 1024
  • 1920 x 1280
  • 3840 x 2560
  • /root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo256.jpeg: 5.88 seconds
  • /root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo512.jpeg: 5.69 seconds
  • /root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo768.jpeg: 5.80 seconds
  • /root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1024.jpeg: 5.62 seconds
  • /root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1280.jpeg: 5.93 seconds
  • /root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo2560.jpeg: 5.80 seconds

测试结果

以下是根据你的描述整理的测试结果表格:

Model显存占用 (MiB)分辨率处理时间 (秒)
InternVL2-40B-AWQ72396384 x 2563.39
768 x 5123.32
1152 x 7683.43
1536 x 10243.65
1920 x 12803.40
3840 x 25603.99
InternVL2-Llama3-76B-AWQ76967384 x 2565.88
768 x 5125.69
1152 x 7685.80
1536 x 10245.62
1920 x 12805.93
3840 x 25605.80

该表包含两个模型的显存占用情况及不同分辨率下的处理时间。

InternVL2-26B-AWQ 测试和评估

启动容器

使用以下命令启动Docker容器:

bash
docker run --runtime nvidia --gpus all \ -v /root/xiedong/:/root/xiedong/ \ -p 23333:23333 \ --ipc=host \ -it --rm \ kevinchina/deeplearning:internvl bash

启动API服务

bash
lmdeploy serve api_server /root/xiedong/InternVL2-26B-AWQ \ --backend turbomind --server-port 23333 --model-format awq

多次测试平均响应时间

对不同尺寸的图片进行10次请求并计算平均响应时间,代码如下:

python
import time from openai import OpenAI # 初始化OpenAI客户端 client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1') # 定义图像路径 image_paths = [ "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo256.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo512.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo768.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1024.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1280.jpeg", "/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo2560.jpeg" ] # 设置请求次数 num_requests = 10 # 存储每个图像的平均响应时间 average_speeds = {} # 遍历每张图片 for image_path in image_paths: total_time = 0 # 对每张图片执行 num_requests 次请求 for _ in range(num_requests): start_time = time.time() # 发送请求并获取响应 response = client.chat.completions.create( model="/root/xiedong/InternVL2-26B-AWQ", messages=[{ 'role': 'user', 'content': [{ 'type': 'text', 'text': 'Describe the image please', }, { 'type': 'image_url', 'image_url': { 'url': image_path, }, }], }], temperature=0.8, top_p=0.8 ) # 记录响应时间 elapsed_time = time.time() - start_time total_time += elapsed_time # 打印当前请求的响应内容(可选) print(f"Response for {image_path}: {response.choices[0].message.content}") # 计算并记录该图像的平均响应时间 average_speed = total_time / num_requests average_speeds[image_path] = average_speed print(f"Average speed for {image_path}: {average_speed} seconds") # 输出所有图像的平均响应时间 for image_path, avg_speed in average_speeds.items(): print(f"{image_path}: {avg_speed:.2f} seconds")

显存占用 71997MiB :

image.png

结果展示

  • 384 x 256
  • 768 x 512
  • 1152 x 768
  • 1536 x 1024
  • 1920 x 1280
  • 3840 x 2560
  • /root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo256.jpeg: 2.71 seconds
  • /root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo512.jpeg: 2.45 seconds
  • /root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo768.jpeg: 2.46 seconds
  • /root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1024.jpeg: 2.57 seconds
  • /root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1280.jpeg: 2.59 seconds
  • /root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo2560.jpeg: 2.55 seconds
如果对你有用的话,可以打赏哦
打赏
ali pay
wechat pay

本文作者:Dong

本文链接:

版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!