本项目旨在构建一个基于InternVL2-40B
模型的多模态API服务。该服务使用Docker环境,利用lmdeploy
库进行模型部署,并通过API接口接收图像和文本输入,生成描述性文本输出。
使用以下Dockerfile
构建环境,其中包含CUDA
和依赖库配置。
dockerfileARG CUDA_VERSION=cu12 FROM openmmlab/lmdeploy:latest-cu12 AS cu12 ENV CUDA_VERSION_SHORT=cu123 FROM ${CUDA_VERSION} AS final RUN python3 -m pip install timm RUN python3 -m pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+${CUDA_VERSION_SHORT}torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
构建命令:
bashdocker build --build-arg CUDA_VERSION=cu12 -t kevinchina/deeplearning:internvl . docker push kevinchina/deeplearning:internvl
使用以下命令启动Docker容器:
bashdocker run --runtime nvidia --gpus all \
-v /root/xiedong/:/root/xiedong/ \
-p 23333:23333 \
--ipc=host \
-it --rm \
kevinchina/deeplearning:internvl bash
bashlmdeploy serve api_server /root/xiedong/InternVL2-40B-AWQ \ --backend turbomind --server-port 23333 --model-format awq
使用以下Python代码调用API,传入图像URL,并接收描述性文本输出:
pythonfrom openai import OpenAI
# 初始化OpenAI客户端
client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
# 获取模型名称
# model_name = client.models.list().data[0].id
#
# print(model_name)
# 发送请求并获取响应
response = client.chat.completions.create(
model="/root/xiedong/InternVL2-40B-AWQ",
messages=[{
'role': 'user',
'content': [{
'type': 'text',
'text': 'Describe the image please',
}, {
'type': 'image_url',
'image_url': {
'url': '/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1024.jpeg',
},
}],
}],
temperature=0.8,
top_p=0.8
)
# 打印返回的文本内容
print(response.choices[0].message.content)
对不同尺寸的图片进行10次请求并计算平均响应时间,代码如下:
pythonimport time
from openai import OpenAI
# 初始化OpenAI客户端
client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
# 定义图像路径
image_paths = [
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo256.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo512.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo768.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1024.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1280.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo2560.jpeg"
]
# 设置请求次数
num_requests = 10
# 存储每个图像的平均响应时间
average_speeds = {}
# 遍历每张图片
for image_path in image_paths:
total_time = 0
# 对每张图片执行 num_requests 次请求
for _ in range(num_requests):
start_time = time.time()
# 发送请求并获取响应
response = client.chat.completions.create(
model="/root/xiedong/InternVL2-40B-AWQ",
messages=[{
'role': 'user',
'content': [{
'type': 'text',
'text': 'Describe the image please',
}, {
'type': 'image_url',
'image_url': {
'url': image_path,
},
}],
}],
temperature=0.8,
top_p=0.8
)
# 记录响应时间
elapsed_time = time.time() - start_time
total_time += elapsed_time
# 打印当前请求的响应内容(可选)
print(f"Response for {image_path}: {response.choices[0].message.content}")
# 计算并记录该图像的平均响应时间
average_speed = total_time / num_requests
average_speeds[image_path] = average_speed
print(f"Average speed for {image_path}: {average_speed} seconds")
# 输出所有图像的平均响应时间
for image_path, avg_speed in average_speeds.items():
print(f"{image_path}: {avg_speed:.2f} seconds")
显存占用 72396MiB 。
384 x 256
768 x 512
1152 x 768
1536 x 1024
1920 x 1280
3840 x 2560
demo256.jpeg: 3.39 seconds
demo512.jpeg: 3.32 seconds
demo768.jpeg: 3.43 seconds
demo1024.jpeg: 3.65 seconds
demo1280.jpeg: 3.40 seconds
demo2560.jpeg: 3.99 seconds
测试结果表明,随着图片分辨率的提升,响应时间相应增加。
启动容器
使用以下命令启动Docker容器:
bashdocker run --runtime nvidia --gpus all \
-v /root/xiedong/:/root/xiedong/ \
-p 23333:23333 \
--ipc=host \
-it --rm \
kevinchina/deeplearning:internvl bash
启动API服务
bashlmdeploy serve api_server /root/xiedong/InternVL2-Llama3-76B-AWQ \ --backend turbomind --server-port 23333 --model-format awq
对不同尺寸的图片进行10次请求并计算平均响应时间,代码如下:
pythonimport time
from openai import OpenAI
# 初始化OpenAI客户端
client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
# 定义图像路径
image_paths = [
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo256.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo512.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo768.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1024.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1280.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo2560.jpeg"
]
# 设置请求次数
num_requests = 10
# 存储每个图像的平均响应时间
average_speeds = {}
# 遍历每张图片
for image_path in image_paths:
total_time = 0
# 对每张图片执行 num_requests 次请求
for _ in range(num_requests):
start_time = time.time()
# 发送请求并获取响应
response = client.chat.completions.create(
model="/root/xiedong/InternVL2-Llama3-76B-AWQ",
messages=[{
'role': 'user',
'content': [{
'type': 'text',
'text': 'Describe the image please',
}, {
'type': 'image_url',
'image_url': {
'url': image_path,
},
}],
}],
temperature=0.8,
top_p=0.8
)
# 记录响应时间
elapsed_time = time.time() - start_time
total_time += elapsed_time
# 打印当前请求的响应内容(可选)
print(f"Response for {image_path}: {response.choices[0].message.content}")
# 计算并记录该图像的平均响应时间
average_speed = total_time / num_requests
average_speeds[image_path] = average_speed
print(f"Average speed for {image_path}: {average_speed} seconds")
# 输出所有图像的平均响应时间
for image_path, avg_speed in average_speeds.items():
print(f"{image_path}: {avg_speed:.2f} seconds")
显存占用76967MiB:
以下是根据你的描述整理的测试结果表格:
Model | 显存占用 (MiB) | 分辨率 | 处理时间 (秒) |
---|---|---|---|
InternVL2-40B-AWQ | 72396 | 384 x 256 | 3.39 |
768 x 512 | 3.32 | ||
1152 x 768 | 3.43 | ||
1536 x 1024 | 3.65 | ||
1920 x 1280 | 3.40 | ||
3840 x 2560 | 3.99 | ||
InternVL2-Llama3-76B-AWQ | 76967 | 384 x 256 | 5.88 |
768 x 512 | 5.69 | ||
1152 x 768 | 5.80 | ||
1536 x 1024 | 5.62 | ||
1920 x 1280 | 5.93 | ||
3840 x 2560 | 5.80 |
该表包含两个模型的显存占用情况及不同分辨率下的处理时间。
启动容器
使用以下命令启动Docker容器:
bashdocker run --runtime nvidia --gpus all \
-v /root/xiedong/:/root/xiedong/ \
-p 23333:23333 \
--ipc=host \
-it --rm \
kevinchina/deeplearning:internvl bash
启动API服务
bashlmdeploy serve api_server /root/xiedong/InternVL2-26B-AWQ \ --backend turbomind --server-port 23333 --model-format awq
对不同尺寸的图片进行10次请求并计算平均响应时间,代码如下:
pythonimport time
from openai import OpenAI
# 初始化OpenAI客户端
client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
# 定义图像路径
image_paths = [
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo256.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo512.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo768.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1024.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo1280.jpeg",
"/root/xiedong/Qwen2-VL-72B-Instruct-GPTQ-Int4/demo2560.jpeg"
]
# 设置请求次数
num_requests = 10
# 存储每个图像的平均响应时间
average_speeds = {}
# 遍历每张图片
for image_path in image_paths:
total_time = 0
# 对每张图片执行 num_requests 次请求
for _ in range(num_requests):
start_time = time.time()
# 发送请求并获取响应
response = client.chat.completions.create(
model="/root/xiedong/InternVL2-26B-AWQ",
messages=[{
'role': 'user',
'content': [{
'type': 'text',
'text': 'Describe the image please',
}, {
'type': 'image_url',
'image_url': {
'url': image_path,
},
}],
}],
temperature=0.8,
top_p=0.8
)
# 记录响应时间
elapsed_time = time.time() - start_time
total_time += elapsed_time
# 打印当前请求的响应内容(可选)
print(f"Response for {image_path}: {response.choices[0].message.content}")
# 计算并记录该图像的平均响应时间
average_speed = total_time / num_requests
average_speeds[image_path] = average_speed
print(f"Average speed for {image_path}: {average_speed} seconds")
# 输出所有图像的平均响应时间
for image_path, avg_speed in average_speeds.items():
print(f"{image_path}: {avg_speed:.2f} seconds")
显存占用 71997MiB :
本文作者:Dong
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC。本作品采用《知识共享署名-非商业性使用 4.0 国际许可协议》进行许可。您可以在非商业用途下自由转载和修改,但必须注明出处并提供原作者链接。 许可协议。转载请注明出处!