llama.cpp 跑 qwen2.5 量化模型

deepseekv3对Qwen-2.5-14B进行蒸馏，模型如下：

https://huggingface.co/arcee-ai/Virtuoso-Small-v2

sglang运行指令：

bash
展开代码
docker run --gpus '"device=5"' \
    --shm-size 32g \
    -d -p 7890:7890 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    -v /data/xiedong/Virtuoso-Small-v2:/data/xiedong/Qwen2.5-32B-Instruct-GPTQ-Int4 \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path /data/xiedong/Qwen2.5-32B-Instruct-GPTQ-Int4 --host 0.0.0.0 --port 7890 --tp 1 --api-key "ns34xx.."

sglang运行速度：42.27 T/s

llama.cpp 量化此模型：

https://huggingface.co/bartowski/arcee-ai_Virtuoso-Small-v2-GGUF/blob/main/arcee-ai_Virtuoso-Small-v2-Q8_0.gguf

llama.cpp 教程：

https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md

llama.cpp server执行指令：

bash
展开代码
docker pull ghcr.io/ggml-org/llama.cpp:server-cuda


docker run --gpus '"device=7"' \
    --shm-size 32g \
    -v /data/xiedong:/models -p 7893:8000 ghcr.io/ggml-org/llama.cpp:server-cuda -m /models/arcee-ai_Virtuoso-Small-v2-Q6_K_L.gguf --port 8000 --host 0.0.0.0 -n 512