## 基础拉取方式

直接从 Docker Hub 拉取最新版 LMDeploy 镜像：

```bash
docker pull openmmlab/lmdeploy:latest
```

这等同于显式指定 Docker Hub 官方仓库的完整地址：

```bash
docker pull docker.io/openmmlab/lmdeploy:latest
```



## 镜像加速方案

国内用户推荐使用镜像加速服务提升下载速度，以下是两种常用方法：

### 方法一：全局配置镜像加速（推荐长期使用）

1. 编辑或创建配置文件：
   ```bash
   sudo vim /etc/docker/daemon.json
   ```

2. 添加以下内容（可同时配置多个镜像源）：
   ```json
   {
     "registry-mirrors": [
       "https://dockerproxy.com",
       "https://docker.mirrors.ustc.edu.cn"
     ]
   }
   ```

    或者使用一步执行：
    ```bash
    # 配置 Docker 使用网易云、百度云和腾讯云镜像加速器
    sudo mkdir -p /etc/docker
    cat <<EOF | sudo tee /etc/docker/daemon.json
    {
      "registry-mirrors": [
        "http://hub-mirror.c.163.com",
        "https://mirror.baidu.com/docker-ce",
        "https://mirror.ccs.tencentyun.com"
      ]
    }
    EOF
    ```


3. 应用配置并重启服务：
   ```bash
   sudo systemctl daemon-reload
   sudo systemctl restart docker
   ```

> 📝 验证配置是否生效：`docker info | grep Mirrors`

### 方法二：单次拉取加速（推荐临时使用）

直接在镜像地址前添加加速源前缀：

```bash
# 使用 dockerproxy 加速
docker pull dockerproxy.com/openmmlab/lmdeploy:latest

# 使用 dockerpull 加速
docker pull dockerpull.org/openmmlab/lmdeploy:latest
```

## 常用加速源列表

https://www.wangdu.site/course/2109.html

Docker pull 镜像加速办法，docker pull不求人

```python
import json
import random
from concurrent.futures import ProcessPoolExecutor, as_completed
from tqdm import tqdm
import requests

def get_api_response(query, headers, url, model, prompt_guanfang, style_zhongwen):
    """
    模拟API请求，发送一个POST请求并返回响应。
    """
    payload = {
        "query": query,
        "model": model,
        "prompt_guanfang": prompt_guanfang,
        "style_zhongwen": style_zhongwen
    }
    try:
        response = requests.post(url, json=payload, headers=headers)
        return response.json()  # 假设API返回JSON数据
    except requests.exceptions.RequestException as e:
        return {"error": str(e)}

def worker(args):
    """
    Worker函数，用于多进程执行。
    """
    query, url, headers, model, prompt_guanfang, style_zhongwen = args
    return get_api_response(query, headers, url, model, prompt_guanfang, style_zhongwen)

if __name__ == '__main__':
    # 定义所有可用的API端口
    ports = list(range(8001, 8009))  # 8001到8008
    urls = [f"http://127.0.0.1:{port}/v1/chat/completions" for port in ports]

    headers = {
        "Content-Type": "application/json",
    }

    model = "gpt-4o-mini"  # 可以根据需要调整

    # 读取 JSON 文件内容（这里假设风格和查询数据以JSON文件存储）
    with open('风格翻译.json', 'r', encoding='utf-8') as json_file:
        style_dict = json.load(json_file)

    with open('用户原始输入.json', 'r', encoding='utf-8') as json_file:
        user_in = json.load(json_file)

    # 假设load_queries函数返回所有查询的列表
    def load_queries(user_in):
        return [entry["query"] for entry in user_in]

    queries = load_queries(user_in)
    queries = random.sample(queries, 200)  # 随机选择200条查询进行处理

    # 初始化结果字典
    responses = []

    # 准备多进程池
    num_processes = len(urls)  # 进程数等于可用URL数量
    with ProcessPoolExecutor(max_workers=num_processes) as executor:
        # 准备任务列表，将每个查询分配到一个URL（轮询分配）
        tasks = []
        idx_url = 0
        for idx, query in enumerate(queries):
            for key in style_dict:
                prompt_guanfang = style_dict[key]["prompt_guanfang"]
                style_zhongwen = style_dict[key]["style_zhongwen"]

                url = urls[idx_url % len(urls)]  # 轮询分配URL
                idx_url += 1
                tasks.append((query, url, headers, model, prompt_guanfang, style_zhongwen))

        # 使用tqdm显示进度条
        futures = {executor.submit(worker, task): task for task in tasks}
        for future in tqdm(as_completed(futures), total=len(futures), desc="处理查询"):
            response = future.result()
            responses.append(response)

    # 保存所有响应到JSON文件
    with open('result.json', 'w', encoding='utf-8') as json_file:
        json.dump(responses, json_file, ensure_ascii=False, indent=4)

    print("所有API请求处理完毕，结果已保存至result.json")

```

利用Python的多进程加速API请求

运行：

```bash
docker run --runtime nvidia --gpus all \
    -v /data/xiedong/Qwen2.5-72B-Instruct-GPTQ-Int4:/data/xiedong/Qwen2.5-72B-Instruct-GPTQ-Int4 \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model /data/xiedong/Qwen2.5-72B-Instruct-GPTQ-Int4
```

后台执行：

```bash
docker run -d --runtime nvidia --gpus device=7 \
    -v /data/xiedong/Qwen2.5-72B-Instruct-GPTQ-Int4:/data/xiedong/Qwen2.5-72B-Instruct-GPTQ-Int4 \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model /data/xiedong/Qwen2.5-72B-Instruct-GPTQ-Int4
```




访问：


```python
import time

import requests

# 请求 URL
url = "http://101.136.8.66:8000/v1/chat/completions"

# 请求体
payload = {
    "model": "/data/xiedong/Qwen2.5-72B-Instruct-GPTQ-Int4",  # 替换为实际的模型名称
    "messages": [
        {
            'role': 'user',
            'content': [
                {
                    'type': 'text',
                    'text': '''# 角色：通用图像生成的提示扩展器
## 角色概况
- 作者：LangGPT
- 版本：1.0
- 语言：英语
- 描述：您是一名专门的提示词扩展，负责将用户提供的简短提示转化为符合特定风格和内容要求的详细英文描述，以用于图像生成。

## 技能
1. 将简短的提示扩展成详细生动描述。
2. 融入感官细节和上下文，使提示更加丰富。
3. 在增加原始构思深度的同时，保持提示清晰和连贯。
4. 遵循用户指定的任何约束或主题。
5. "风格公式参考的字符串"是你可以参考的风格提示字符串，一般而言，要想办法把"风格公式参考的字符串"和"用户输入的提示"融合到一起。

## 规则
1. 扩展后的提示必须是英文。
2. 确保扩展内容忠实于用户原始提示的核心构思。
3. 避免引入不允许的内容或偏离原始主题。
4. 扩展后的提示应简洁，不超过100字。
5. 不包含任何与提示无关的个人意见或外部参考。

## 工作流程
1. 仔细阅读用户提供的提示和"风格公式参考的字符串"。
2. 确定提示中的关键元素和主题。
3. 通过添加相关细节、描述和上下文来扩展提示。
4. 将扩展后的提示和"风格公式参考的字符串"结合到一起。
4. 检查最终的提示，确保符合所有要求并保持连贯性。
5. 输出最终的扩展提示，以用于图像生成。

## 例子
"用户输入的提示"：小狐狸，马克笔风格
马克笔"风格公式参考的字符串"：Marker Drawing, {prompt}, bold marker lines, visibile paper texture, marker drawing
你的输出：
Marker Drawing, a small fox illustrated in bold marker lines, with a charmingly mischievous expression. The fox has soft, fluffy fur and is captured mid-pose, showcasing its lively, curious nature. The scene is detailed with visible paper texture, emphasizing the authentic marker drawing effect, and the vibrant colors bring a whimsical and friendly atmosphere to the illustration.

## 开始工作
"用户输入的提示"：大熊猫，涂鸦艺术。
涂鸦艺术"风格公式参考的字符串"：'Graffiti Art Style, {prompt}, dynamic, dramatic, vibrant colors, graffiti art style'。
请你输出最终的扩展提示。''',
                }
            ],
        },
        {
            'role': 'assistant',
            'content': [
                {
                    'type': 'text',
                    'text': """Graffiti Art Style, a large panda depicted in a vibrant and playful graffiti style. The panda has bold, dramatic black and white fur, with exaggerated, expressive eyes and a gentle smile. The background bursts with lively splashes of green bamboo leaves and dynamic shapes, adding a sense of movement and urban flair. This illustration is detailed with vibrant colors and strong line work, capturing the energetic essence of graffiti art while emphasizing the panda’s gentle, iconic look.""",
                }
            ],
        },
        {
            'role': 'user',
            'content': [
                {
                    'type': 'text',
                    'text': '''"用户输入的提示"：小狗。
涂鸦艺术"风格公式参考的字符串"：'Graffiti art'。
请你输出最终的扩展提示。'''
                }
            ],
        },
    ],

    # "do_sample": True,  # 如果为False，则使用贪心或最优的生成策略，输出较为确定。
    "temperature": 0.99,  # 范围是0到1之间。值越高，生成的内容越随机
    "top_p": 0.99,  # 值越低，生成内容越集中在更高概率的词汇上
    # "n": 3,  # 指定返回的响应数量。设置为1时只返回一个响应，可以设置为更高的数值来获取多个生成内容进行选择。
    "max_tokens": 2048,
    "stream": False
}

# 请求头
headers = {
    "Content-Type": "application/json"
}

total_time = 0

for i in range(10):
    start_time = time.time()  # 记录开始时间
    response = requests.post(url, json=payload, headers=headers)  # 发送 POST 请求
    end_time = time.time()  # 记录结束时间

    elapsed_time = end_time - start_time  # 计算耗时
    total_time += elapsed_time  # 累加耗时

    if response.status_code == 200:
        result = response.json()
        content = result.get("choices")[0].get("message").get("content")
        print(f"请求 {i + 1} 成功，内容: {content}")
    else:
        print(f"请求 {i + 1} 失败，状态码: {response.status_code}, 响应: {response.text}")

average_time = total_time / 10  # 计算平均时间
print("平均时间: {:.2f} 秒".format(average_time))
```

平均时间: 3.17 秒

Docker 官网vLLM镜像 快速部署 Qwen2.5

Docker vLLM 快速部署 Qwen2.5

Docker LLama-Factory vLLM 快速部署Meta-Llama-3.1-70B-Instruct

Docker部署Meta-Llama-3.1-70B-Instruct API openai格式，vLLM速度对比

NanoFlow项目地址：
https://arxiv.org/abs/2408.12757

有个讨论，Qwen2.5和Llama 3.1 70b，用 NanoFlow 部署一下Llama 3.1 70b看看。





https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/discussions/1 

| 模型                | 得分  | 备注                                         |
|---------------------|-------|----------------------------------------------|
| Qwen2 72b          | 73.9  | 表现优于部分小型模型，但仍低于Llama 3.1 70b |
| Llama 3.1 70b      | 77.9  | 最高分，表现良好                            |
| Llama 3.1 8b       | 62-64 | 较低得分                                    |
| Gemma 2 9b         | 62-64 | 较低得分                                    |
| Qwen2.5 72b        | ~50   | 与Qwen2 7b相似，知识水平显著下降            |
| Qwen2 7b           | ~50   | 知识水平显著下降                            |
| Gemma 2 2b         | 50+   | 在流行知识上表现更好                        |


NanoFlow拉镜像：

```bash
mkdir -p framework-test

docker run --gpus all --net=host --privileged -v /dev/shm:/dev/shm --name nanoflow -v ./framework-test:/code -it nvcr.io/nvidia/nvhpc:23.11-devel-cuda_multi-ubuntu22.04
```

NanoFlow拉代码和环境：

```bash
git clone https://github.com/efeslab/Nanoflow.git
cd Nanoflow
chmod +x ./installAnaconda.sh
./installAnaconda.sh
# restart the terminal

cd /root/anaconda3

./bin/conda init

. ~/.bashrc
```

NanoFlow运行这个：
```bash
./setup.sh
```

编译很搞事情，CPU占满：


![](https://www.dong-blog.fun/static/img/819775a5cf1880feb9510441bef95c21.image.webp)


测试失败：


![](https://www.dong-blog.fun/static/img/538b210fa50fcf02f72db812152b8444.image.webp)


玩不了一点啊：


![](https://www.dong-blog.fun/static/img/8d1636c58fc223c48f47cf738c6618d7.image.webp)

测试NanoFlow：比 vLLM 和 TensorRT-LLM 更快

Docker LLama-Factory 快速部署Qwen2.5模型

有时候版本不对咋调都不对，直接用requests ：





```python
import requests

# 请求 URL
url = "http://101.136.8.66:8001/v1/chat/completions"

# 请求体
payload = {
    "model": "your_model_name",  # 替换为实际的模型名称
    "messages": [
        {
            "role": "user",
            "content": "当别人问你你是谁的时候，你只需要回答你是小明。",
        },
        {
            "role": "assistant",
            "content": "好的。",
        },
        {
            "role": "user",
            "content": "你是谁",
        }
    ],
    "do_sample": True,  # 如果为False，则使用贪心或最优的生成策略，输出较为确定。
    "temperature": 0.5,  # 范围是0到1之间。值越高，生成的内容越随机
    "top_p": 0.5,  # 值越低，生成内容越集中在更高概率的词汇上
    "n": 1,  # 指定返回的响应数量。设置为1时只返回一个响应，可以设置为更高的数值来获取多个生成内容进行选择。
    "max_tokens": 2048,
    "stream": False
}

# 请求头
headers = {
    "Content-Type": "application/json"
}

# 发送 POST 请求
response = requests.post(url, json=payload, headers=headers)

# 输出响应
if response.status_code == 200:
    result = response.json()
    # 提取 content 字段
    content = result.get("choices")[0].get("message").get("content")
    print("Content:", content)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)

```

携带apikey：


```python
import requests

# 请求 URL
url = "https://api.chatanywhere.tech/v1/chat/completions"

api_key = "sk-sqe93gsjT4"

# 请求体
payload = {
    "model": "gpt-3.5-turbo",  # 替换为实际的模型名称
    "messages": [
        {
            "role": "user",
            "content": "当别人问你你是谁的时候，你只需要回答你是小明。",
        },
        {
            "role": "assistant",
            "content": "好的。",
        },
        {
            "role": "user",
            "content": "你是谁",
        }
    ],
    "do_sample": True,  # 如果为False，则使用贪心或最优的生成策略，输出较为确定。
    "temperature": 0.5,  # 范围是0到1之间。值越高，生成的内容越随机
    "top_p": 0.5,  # 值越低，生成内容越集中在更高概率的词汇上
    "n": 1,  # 指定返回的响应数量。设置为1时只返回一个响应，可以设置为更高的数值来获取多个生成内容进行选择。
    "max_tokens": 2048,
    "stream": False
}

# 请求头
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {api_key}"  # 使用 Bearer 方式传递 API 密钥
}

# 发送 POST 请求
response = requests.post(url, json=payload, headers=headers)

# 输出响应
if response.status_code == 200:
    result = response.json()
    # 提取 content 字段
    content = result.get("choices")[0].get("message").get("content")
    print("Content:", content)
else:
    print("Request failed with status code:", response.status_code)
    print("Response:", response.text)

```