StableDiffusionProcessingTxt2Img: 文本到图像生成过程详解

StableDiffusionProcessingTxt2Img 是 stable-diffusion-webui 中实现文本到图像生成的核心类。整个生成过程是一个复杂的管道，下面我将分步骤详细解析这个过程。

1. 处理流程概述

文本到图像生成的整个流程由 process_images 函数统筹，主要包括以下几个大的阶段：

初始化处理参数：设置随机种子、分辨率、采样器等
生成潜在空间的噪声：为扩散模型创建初始噪声
采样过程：逐步去噪以生成潜在空间的表示
解码阶段：将潜在空间表示转换为实际图像
高分辨率修复(可选)：如果启用了高分辨率修复(Hires.fix)，对生成的图像进行进一步处理

2. 关键参数解析

StableDiffusionProcessingTxt2Img 类继承自 StableDiffusionProcessing，包含了多种参数：

prompt: 用于生成图像的提示文本
negative_prompt: 描述不希望在图像中出现的元素
steps: 采样步数，决定了去噪过程的精细程度
cfg_scale: 条件引导比例，控制生成图像遵循提示的严格程度
width/height: 生成图像的宽度和高度
seed: 随机种子，确保结果可重现
sampler_name: 使用的采样算法名称
enable_hr: 是否启用高分辨率修复
denoising_strength: 高分辨率修复的去噪强度
hr_scale: 高分辨率放大比例
hr_upscaler: 用于高分辨率放大的算法

3. 生成过程详解

3.1 初始化阶段

python
展开代码
def __post_init__(self):
    super().__post_init__()
    
    # 如果指定了首阶段宽高，将其设为预处理尺寸，最终尺寸保存为目标尺寸
    if self.firstphase_width != 0 or self.firstphase_height != 0:
        self.hr_upscale_to_x = self.width
        self.hr_upscale_to_y = self.height
        self.width = self.firstphase_width
        self.height = self.firstphase_height

初始化阶段会设置生成过程中使用的尺寸。如果指定了首阶段尺寸，则先以小尺寸生成图像，随后放大。

3.2 处理提示词

python
展开代码
def setup_prompts(self):
    # 处理主提示词和负面提示词
    self.all_prompts = [shared.prompt_styles.apply_styles_to_prompt(x, self.styles) for x in self.all_prompts]
    self.all_negative_prompts = [shared.prompt_styles.apply_negative_styles_to_prompt(x, self.styles) for x in self.all_negative_prompts]
    
    # 处理高分辨率修复的提示词
    if self.enable_hr:
        self.all_hr_prompts = [shared.prompt_styles.apply_styles_to_prompt(x, self.styles) for x in self.all_hr_prompts]
        self.all_hr_negative_prompts = [shared.prompt_styles.apply_negative_styles_to_prompt(x, self.styles) for x in self.all_hr_negative_prompts]

提示词处理包括应用风格、解析额外网络标记(如LoRA、Textual Inversion)，并转换为模型能理解的条件向量。

3.3 设置条件向量

python
展开代码
def setup_conds(self):
    # 计算主提示和负面提示的条件向量
    self.uc = self.get_conds_with_caching(prompt_parser.get_learned_conditioning, 
                                         self.negative_prompts, self.steps, 
                                         [self.cached_uc, None], self.extra_network_data)
    
    self.c = self.get_conds_with_caching(prompt_parser.get_multicond_learned_conditioning, 
                                        self.prompts, self.steps, 
                                        [self.cached_c, None], self.extra_network_data)

每个提示词都被转换为条件向量，用于在生成过程中指导扩散模型。负面提示同样会被转换为条件向量，用于告诉模型应避免生成什么内容。

3.4 采样过程

python
展开代码
def sample(self, conditioning, unconditional_conditioning, seeds, subseeds, subseed_strength, prompts):
    # 创建采样器
    self.sampler = sd_samplers.create_sampler(self.sampler_name, self.sd_model)
    
    # 获取随机噪声
    x = self.rng.next()
    
    # 执行采样（去噪）过程
    samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, 
                                  image_conditioning=self.txt2img_image_conditioning(x))
    
    # 如果不启用高分辨率修复，直接返回结果
    if not self.enable_hr:
        return samples
        
    # 否则进入高分辨率修复流程...

采样是整个生成过程的核心，主要步骤包括：

创建采样器：根据选择的采样器名称创建对应的采样器对象
生成初始噪声：根据随机种子生成初始噪声
逐步去噪：根据条件向量引导，逐步将噪声转换为有意义的潜在空间表示

采样器（如 DPM++ 2M、Euler a 等）使用不同的数学方法进行去噪过程，影响最终图像的质量和风格。

3.5 高分辨率修复流程 (Hires.fix)

python
展开代码
def sample_hr_pass(self, samples, decoded_samples, seeds, subseeds, subseed_strength, prompts):
    # 设置目标分辨率
    target_width = self.hr_upscale_to_x
    target_height = self.hr_upscale_to_y
    
    # 使用潜在空间上采样或图像上采样
    if self.latent_scale_mode is not None:
        # 潜在空间上采样
        samples = torch.nn.functional.interpolate(samples, 
                                                  size=(target_height // opt_f, target_width // opt_f), 
                                                  mode=self.latent_scale_mode["mode"], 
                                                  antialias=self.latent_scale_mode["antialias"])
    else:
        # 图像空间上采样
        batch_images = []
        for x_sample in lowres_samples:
            image = Image.fromarray(x_sample)
            image = images.resize_image(0, image, target_width, target_height, upscaler_name=self.hr_upscaler)
            batch_images.append(np.array(image))
        decoded_samples = torch.from_numpy(np.array(batch_images))
        samples = images_tensor_to_samples(decoded_samples)
    
    # 第二次采样（高分辨率修复）
    self.rng = rng.ImageRNG(samples.shape[1:], self.seeds)
    noise = self.rng.next()
    samples = self.sampler.sample_img2img(self, samples, noise, self.hr_c, self.hr_uc, 
                                         steps=self.hr_second_pass_steps or self.steps)
    
    return samples

高分辨率修复流程（当 enable_hr=True 时）包括：

上采样：将第一次生成的低分辨率结果放大到目标尺寸
- 可以在潜在空间上采样（更快但质量稍差）
- 也可以在图像空间上采样（更慢但质量更好）
二次采样：对放大后的图像进行第二次去噪，使用 img2img 模式
- 使用 denoising_strength 控制保留原始内容的程度
- 可以使用不同的采样器和步数

这个过程允许生成分辨率远高于模型原始训练分辨率的图像，同时保持细节质量。

3.6 解码和后处理

python
展开代码
# process_images_inner 函数中的片段
x_samples_ddim = decode_latent_batch(p.sd_model, samples_ddim, target_device=devices.cpu)
x_samples_ddim = torch.stack(x_samples_ddim).float()
x_samples_ddim = torch.clamp((x_samples_ddim + 1.0) / 2.0, min=0.0, max=1.0)

最后一步是将潜在空间的表示解码为实际图像：

解码：使用 VAE 解码器将潜在空间表示转换为 RGB 图像
归一化：将像素值从 [-1, 1] 范围映射到 [0, 1] 范围
后处理：可能包括人脸修复、保存元数据等

4. 采样器实现

采样器是生成过程的关键部分，直接影响图像质量和生成特性。以 KDiffusionSampler 为例：

python
展开代码
def sample(self, p, x, conditioning, unconditional_conditioning, steps=None, image_conditioning=None):
    steps = steps or p.steps
    sigmas = self.get_sigmas(p, steps)  # 计算噪声调度
    
    # 创建初始噪声
    x = x * sigmas[0]
    
    # 准备去噪参数
    self.model_wrap_cfg.init_latent = x
    self.sampler_extra_args = {
        'cond': conditioning,  # 条件向量（正面提示）
        'image_cond': image_conditioning,  # 图像条件（用于img2img）
        'uncond': unconditional_conditioning,  # 无条件向量（负面提示）
        'cond_scale': p.cfg_scale,  # 条件引导比例
    }
    
    # 执行采样算法
    samples = self.func(self.model_wrap_cfg, x, extra_args=self.sampler_extra_args, ...)
    
    return samples

采样器负责执行如 DDIM、DPM++、Euler 等去噪算法，通过噪声调度（sigmas）控制去噪过程，并使用条件向量（提示词编码）引导生成内容。

5. 整体流程总结

文本到图像的完整生成流程可以总结为：

用户输入处理：解析提示词、设置参数
初始化：设置随机种子、选择模型和采样器
条件编码：将提示词转换为条件向量
低分辨率生成：
- 随机噪声初始化
- 逐步去噪采样
- 解码为图像
高分辨率修复（如果启用）：
- 上采样到更高分辨率
- 第二次去噪采样
- 解码最终图像
后处理：人脸修复、保存元数据等

这个过程涉及多个关键组件的协同工作，包括文本编码器、扩散模型、VAE解码器以及各种采样算法，最终实现从文本描述到高质量图像的转换。

通过理解这个过程，可以更好地调整参数以获得理想的生成结果，例如通过增加步数提高质量，通过调整 CFG scale 控制创意性与准确性的平衡，或通过高分辨率修复获得更高分辨率的输出。

目录