Creating a multistep ControlNet model in Python

Joachim Holwech•6/7/2023

If you're looking for a quick and easy way to program a multistep ControlNet model using Python, you've come to the right place! Using the Huggingface Diffusers library, this is fairly easy to do. Let me show you how.

We'll start by importing some libraries. Below we import Image from Pillow, which we will use to load and process the input image we'll be using. From the diffusers library we import DPMSolverMultistepScheduler, which is a scheduler equivalent to the DPM++ 2M, found in Automatic. Likewise, we import the ControlNetModel class used for interacting with the StableDiffusionControlNetPipeline class.

Furthermore, we import the auxiliary models from controlnet_aux. This is a library that's used to load the ControlNet model weights.

from PIL import Image
from diffusers import DPMSolverMultistepScheduler, ControlNetModel, StableDiffusionControlNetPipeline
import torch
from controlnet_aux import MidasDetector, LineartDetector
import PIL.ImageOps

import numpy as np
import cv2

Next we create a resize function, that resizes the longest side of the image to be 512 pixels wide. Stable Diffusion was trained on 512x512 images and works best on images that are close to this size. Additionally, the output will be better if it's a multiple of 8.

def resize_image(image: Image, resolution = 512):
    W, H = image.size
    if resolution < min(W, H):
        k = resolution / min(W, H)
        W *= k
        H *= k

    W_new = int(np.round(W/64) * 64)
    H_new = int(np.round(H/64) * 64)
    return image.resize((W_new, H_new))

We also load the lineart and midas (depth) detectors that will be used to convert the image. Finally the configuration for using ControlNet in Stable Diffusion is loaded. Note that in this example I'm only using two steps, but you can use as many as you want and in any order.

lineart = LineartDetector.from_pretrained("lllyasviel/Annotators")
depth = MidasDetector.from_pretrained("lllyasviel/Annotators")

cn_steps = [
    ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_lineart", torch_dtype=torch.float16),
    ControlNetModel.from_pretrained("lllyasviel/control_v11f1p_sd15_depth", torch_dtype=torch.float16)
]

Afterwards, we setup the Stable Diffusion pipeline and pass as input the cn_steps variable.

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=cn_steps, torch_dtype=torch.float16
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_model_cpu_offload()

generator = torch.Generator(device="cpu").manual_seed(1)

With that out of the way, we can start by opening the example image and resizing it to the optimal size. We apply the ControlNet transformation to the image and input the transformed images as a list in the same order as we input the cn_steps. All we need to do now is to run the pipeline with a prompt and generate an image.

with open("cat.jpg", "rb") as i:
    image = Image.open("cat.jpg").convert("RGB")

image = resize_image(image)

cn_images = [
    lineart(image),
    depth(image),
]

output_image = pipe(
    prompt="A cute cat sitting on marble stairs, high quality, highly-detailed, hyper-realistic, RAW, DSLR",
    negative_prompt="ugly, deformed, malformed, bad, low quality",
    image=cn_images,
    generator=generator,
    guidance_scale=7.5,
    num_inference_steps=40,
).images[0]

We run this test image of a cat with the prompt "A cute cat sitting on marble stairs, high quality, highly-detailed, hyper-realistic, RAW, DSLR". The ControlNet steps ensure that the details in the image is preserved, while the style is changed.

Input image of a cat

Output image of a cat sitting on marble stairs

Subscribe to be notified when Joachim Holwech posts a new article.