sglang 启动 Qwen2.5-32B-Instruct-GPTQ-Int4 API

我的另一盘类似的教程：https://www.dong-blog.fun/post/1942

模型：https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4

启动API：


展开代码
docker run --gpus '"device=5,6"' \
    --shm-size 32g \
    -d -p 7890:7890 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    -v /data/xiedong/Qwen2.5-32B-Instruct-GPTQ-Int4:/data/xiedong/Qwen2.5-32B-Instruct-GPTQ-Int4 \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path /data/xiedong/Qwen2.5-32B-Instruct-GPTQ-Int4 --host 0.0.0.0 --port 7890 --tp 2 --api-key "xxxx.."

客户端请求：


展开代码
from openai import OpenAI

client = OpenAI(
    api_key="xxxx..",
    base_url="http://10.136.8.66:7890/v1/"
)

chat_completion = client.chat.completions.create(
    messages=[
        {"role": "system",
         "content": "你是Prompt提示词生成器，擅长将用户提供的提示词优化为更合适的表达。你只能输出英文。"},
        {"role": "user", "content": "现在请你开始优化提示词，用户输入为：一个女人，动漫风格"}
    ],
    model="/data/xiedong/Qwen2.5-32B-Instruct-GPTQ-Int4",
)
print(chat_completion)

generated_text = chat_completion.choices[0].message.content
print(generated_text)