模型:

CompVis/stable-diffusion-v1-4

任务:

文生图

类库:

Diffusers

其他:

stable-diffusion stable-diffusion-diffusers

预印本库:

arxiv:2207.12598 arxiv:2112.10752 arxiv:2103.00020 arxiv:2205.11487 arxiv:1910.09700

许可:

creativeml-openrail-m

模型介绍文件清单

英文

稳定扩散v1-4模型卡

稳定扩散是一种潜在的文本到图像扩散模型，能够根据任意文本输入生成逼真的图像。关于稳定扩散的功能的更多信息，请参考 🤗's Stable Diffusion with 🧨Diffusers blog 。

稳定扩散-v1-4检查点是使用 Stable-Diffusion-v1-2 检查点的权重初始化的，并在512x512分辨率的"laion-aesthetics v2 5+"上进行了225,000步的微调，同时对文本调节进行了10%的删除以改善 classifier-free guidance sampling 。

这里的权重是用于与🧨 Diffusers库一起使用的。如果您正在寻找加载到CompVis稳定扩散代码库中的权重，请参考 come here 。

模型细节

开发者：Robin Rombach，Patrick Esser
模型类型：基于扩散的文本到图像生成模型
语言：英语
许可证： The CreativeML OpenRAIL M license 是 Open RAIL M license 的一个实例，该实例是在 BigScience 和 the RAIL Initiative 共同进行的负责任的AI许可证领域的工作的基础上进行调整的。另请参考我们的许可证的基础 the article about the BLOOM Open RAIL license 。
模型描述：这是一个可用于根据文本提示生成和修改图像的模型。这是一个 Latent Diffusion Model ，它使用一个固定的预训练文本编码器（ CLIP ViT-L/14 ）作为 Imagen paper 中建议的。
更多信息的资源： GitHub Repository ， Paper 。

引用如下：

@InProceedings{Rombach_2022_CVPR,
    author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
    title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {10684-10695}
}

示例

我们建议使用 🤗's Diffusers library 来运行稳定扩散。

PyTorch

pip install --upgrade diffusers transformers scipy

使用默认的PNDM调度器运行流程：

import torch
from diffusers import StableDiffusionPipeline

model_id = "CompVis/stable-diffusion-v1-4"
device = "cuda"


pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to(device)

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  
    
image.save("astronaut_rides_horse.png")

注意：如果您的GPU内存受限，并且可用的GPU内存少于4GB，请确保将StableDiffusionPipeline加载为float16精度，而不是默认的float32精度，如上所示。您可以通过告诉diffusers期望权重以数据类型float16的精度来这样做：

import torch

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to(device)
pipe.enable_attention_slicing()

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  
    
image.save("astronaut_rides_horse.png")

要更换噪声调度器，请将其传递给from_pretrained：

from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler

model_id = "CompVis/stable-diffusion-v1-4"

# Use the Euler scheduler here instead
scheduler = EulerDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  
    
image.save("astronaut_rides_horse.png")

JAX/Flax

要在TPU和GPU上使用StableDiffusion进行更快的推断，可以利用JAX/Flax。

使用默认的PNDMScheduler运行流程

import jax
import numpy as np
from flax.jax_utils import replicate
from flax.training.common_utils import shard

from diffusers import FlaxStableDiffusionPipeline

pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", revision="flax", dtype=jax.numpy.bfloat16
)

prompt = "a photo of an astronaut riding a horse on mars"

prng_seed = jax.random.PRNGKey(0)
num_inference_steps = 50

num_samples = jax.device_count()
prompt = num_samples * [prompt]
prompt_ids = pipeline.prepare_inputs(prompt)

# shard inputs and rng
params = replicate(params)
prng_seed = jax.random.split(prng_seed, num_samples)
prompt_ids = shard(prompt_ids)

images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))

注意：如果您的TPU内存有限，请确保将FlaxStableDiffusionPipeline加载为bfloat16精度，而不是默认的float32精度，如上所示。您可以通过告诉diffusers从"bf16"分支加载权重来这样做。

import jax
import numpy as np
from flax.jax_utils import replicate
from flax.training.common_utils import shard

from diffusers import FlaxStableDiffusionPipeline

pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", revision="bf16", dtype=jax.numpy.bfloat16
)

prompt = "a photo of an astronaut riding a horse on mars"

prng_seed = jax.random.PRNGKey(0)
num_inference_steps = 50

num_samples = jax.device_count()
prompt = num_samples * [prompt]
prompt_ids = pipeline.prepare_inputs(prompt)

# shard inputs and rng
params = replicate(params)
prng_seed = jax.random.split(prng_seed, num_samples)
prompt_ids = shard(prompt_ids)

images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))

用途

直接用途

该模型仅用于研究目的。可能的研究领域和任务包括：

安全部署可能生成有害内容的模型。
探索和理解生成模型的限制和偏见。
生成艺术品并在设计和其他艺术过程中使用。
教育或创意工具的应用。
生成模型的研究。

下面描述了不包括在内的用途。

滥用、恶意使用和超出范围使用

注意：此部分是根据 DALLE-MINI model card 获取的，但适用于Stable Diffusion v1。

该模型不应用于故意创建或传播人们可能会认为令人不安、苦恼或冒犯的图像，或传播历史或现实刻板印象的内容。这包括生成人们能够预见到会感到不安、苦恼或冒犯的图像，或者传播历史或当前刻板印象的内容。

超出范围的使用

该模型的训练目的不是作为人们或事件的真实呈现，因此使用该模型生成此类内容超出了该模型的能力范围。

滥用和恶意使用

滥用该模型来生成对个人有残酷行为是对该模型的滥用。这包括但不限于：

生成贬低、贬低或以其他方式有害的人或其环境、文化、宗教等的图像。
有意推广或传播歧视性内容或有害刻板印象。
未经个人同意进行个人冒充。
未经可能看到该内容的人的同意，生成性内容。
误导和虚假信息
对致命暴力和血腥的描述
共享版权或许可的材料，违反其使用条款。
共享违反其使用条款的版权或许可材料的内容。

限制和偏见

限制

该模型无法实现完美的逼真度
该模型无法呈现可读的文本
该模型在涉及组合性的更复杂任务中表现不佳，比如呈现与“一个红色立方体放在一个蓝色球上”相对应的图像。
脸部和人物可能无法正确生成。
该模型主要使用英文标题进行训练，在其他语言中的工作效果不佳。
模型的自动编码部分具有损失性。
模型是在包含成人内容的大规模数据集 LAION-5B 上训练的，如果没有附加的安全机制和注意事项，它不适用于产品使用。
没有使用额外的措施对数据集进行去重。因此，我们观察到训练数据中会有一定程度的重复，对于训练数据中重复的图像进行检测可能有所帮助。

偏见

尽管图像生成模型的能力令人印象深刻，但它们也可能强化或加剧社会偏见。稳定扩散v1是在 LAION-2B(en) 的子集上进行训练的，该子集中的图像主要限于英文描述。来自使用其他语言的社区和文化的文本和图像可能没有得到足够的考虑。这影响了模型的整体输出，因为白人和西方文化通常被设定为默认值。此外，该模型生成非英语提示的能力明显低于使用英语提示的能力。

安全模块

此模型的预期用途是与Diffusers中的 Safety Checker 一起使用。该检查器通过将模型输出与已知的硬编码NSFW概念进行比较来工作。这些概念故意隐藏起来，以减少逆向工程该过滤器的可能性。具体来说，检查器在生成图像后，将概念与每个NSFW概念的手工权重进行比较，这些概念通过嵌入空间中的CLIPTextModel传递到模型中。

训练

训练数据：模型开发人员使用以下数据集训练模型：

LAION-2B（英文）及其子集（参见下一节）

训练过程：稳定扩散v1-4是一种潜在的扩散模型，它将自动编码器与在自动编码器的潜在空间中进行训练的扩散模型结合在一起。训练过程中，

图像通过编码器进行编码，将图像转换为潜在表示。自动编码器使用相对下采样因子8，将形状为H x W x 3的图像映射到形状为H/f x W/f x 4的潜在表示
文本提示通过ViT-L/14文本编码器进行编码。
文本编码器的非池化输出通过交叉注意力传递给潜在扩散模型的UNet骨干。
损失是潜在中添加的噪声与UNet的预测之间的重建目标。

我们目前提供四个检查点，其训练方式如下。

stable-diffusion-v1-1 :在 laion2B-en 上的256x256分辨率进行了237,000步，然后在 laion-high-resolution 上的512x512分辨率进行了194,000步（LAION-5B的170M个示例，分辨率>= 1024x1024）。
stable-diffusion-v1-2 ：从stable-diffusion-v1-1恢复。在"laion-improved-aesthetics"上的512x512分辨率进行了515,000步（laion2B-en的子集，过滤掉原始大小>= 512x512，评估审美得分> 5.0且估计的水印概率< 0.5的图像。水印估计来自LAION-5B元数据，美学评分使用了 improved aesthetics estimator 进行估计）。
stable-diffusion-v1-3 ：从stable-diffusion-v1-2恢复。在"laion-improved-aesthetics"上的512x512分辨率进行了195,000步，并且删除文本调节的10%，以改善 classifier-free guidance sampling 。
stable-diffusion-v1-4 ：从stable-diffusion-v1-2恢复。在"laion-aesthetics v2 5+"上的512x512分辨率进行了225,000步，并且删除文本调节的10%，以改善 classifier-free guidance sampling 。
硬件：32 x 8个A100 GPU
优化器：AdamW
梯度累积：2
批次：32 x 8 x 2 x 4 = 2048
学习率：热身到0.0001，进行10,000步，然后保持不变

评估结果

使用不同的无分类器指导尺度（1.5、2.0、3.0、4.0、5.0、6.0、7.0、8.0）和50个PLMS采样步骤对检查点进行评估显示了相对改进：

使用50个PLMS步骤和COOL2017验证集中的10000个随机提示进行评估，评估分辨率为512x512。不针对FID得分进行优化。

环境影响

稳定扩散v1估计排放基于 Machine Learning Impact calculator 在 Lacoste et al. (2019) 中提出的信息。硬件、运行时间、云提供商和计算区域用于估计碳足迹。

硬件类型：A100 PCIe 40GB
使用的小时数：150000
云提供商：AWS
计算区域：US-east
排放的碳（能耗x时间x基于电网位置的碳排放）：11250千克CO2 eq.

引用

    @InProceedings{Rombach_2022_CVPR,
        author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
        title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        month     = {June},
        year      = {2022},
        pages     = {10684-10695}
    }

本模型卡的作者：Robin Rombach和Patrick Esser，基于 DALL-E Mini model card 。

作者:

CompVis

数据集大小:

15.32 GB