With increasing computing capabilities, current model architectures appear to follow the trend of simply upscaling all components without validating the necessity for doing so. In this project we investigate the size and architectural design of ControlNet [Zhang et al., 2023] for controlling the image generation process with stable diffusion-based models. We show that a new architecture with as little as 1% of the parameters of the base model achieves state-of-the art results, considerably better than ControlNet in terms of FID score. Hence we call it ControlNet-XS. We provide the code for controlling StableDiffusion-XL [Podell et al., 2023] (Model B, 48M Parameters) and StableDiffusion 2.1 [Rombach et al. 2022] (Model B, 14M Parameters), all under openrail license. The different models are explained in the Method section below.

We evaluate differently sized control models and confirm that the size does not even have to be of the same magnitude as the base network, which has 2.6B paramaters. The control is evident for sizes of ControlNet-XS of 400M, 104M and 48M parameters, as shown below for guidance with depth maps (Midas [Ranftl et al., 2020]) and canny edges, respectively. A row shows three example results of Model B, each with a different seed. Note, we use the same seed for each column.

We show generations of three versions of ControlNet-XS with 491M, 55M and 14M parameters respectively. We control Stable Diffusion with depth maps (Midas) and Canny edges. Even the smallest model with 1.6% of the base model size, which has 865M parameters, is able to reliably guide the generation process . As above, a row shows three example results of Model B, each with a different seed. Note, we use the same seed for each column.

The original ControlNet is a copy of the U-Net encoder in the StableDiffusion base model, and hence receives the same input as the base model with an additional guidance signal like an edge map. The intermediate outputs of the trained ControlNet are then added to the inputs of the decoder layers of the base model. Throughout the training process of ControlNet, the weights of the base model are kept frozen. We identify several conceptual issues with such an approach leading to an unnecessarily large ControlNet and to a significant reduction in quality of the generated image:

- The final output image of stable diffusion, which we call the base model, is generated iteratively in a series of time steps. At each time step a U-Net, with an encoder and decoder, is executed as illustrated below. At each iteration, the input to the base model and the control model is the generated image of the previous time step. The control model additionally receives a control image. The problem is that in the encoder phase, both models operate independently, and the feedback from the control model enters only in the decode phase of the base model. The result is a delayed correction/controlling mechanism, and it implies that the ControlNet has to do two jobs. Instead of solely focusing all network capacity on correction/controlling, ControlNet has to additionally anticipate in advance, what “mistakes” the Encoder of the base model is going to make.
- By implying that image generation and controlling require similar model capacities, it is natural to initialize the weights of ControlNet with the weights of the base model, and then fine-tuning them. With our ControlNet-XS we diverge in design from the base model, and hence train the weights of ControlNet-XS from scratch.

We address the first problem (i) of delayed feedback by adding connections from the Encoder base model into the controlling Encoder (A). In this way, the corrections can adapt more quickly to the generation process of the based model. Nonetheless, it does not eliminate the delay entirely, since the encoder of the base model still remains unguided. Hence, we add additional connections from ControlNet-XS into the base model encoder, directly influencing the entire generative process (B). For completeness, we evaluate if there is any benefit in using a mirrored, decoding architecture in the ControlNet setup (C).

We evaluate the performance of three variations (A, B, C) for Canny edge guidance in comparison to the original ControlNet in terms of FID-score over the validation set of COCO2017 [Lin et al., 2014]. All of our variations achieve a significant improvement, while having just a fraction of the parameters of the original ControlNet.

We focus our attention on variant B and train it with three different model sizes for canny and depth map guidance, respectively, and for StableDiffusion 2.1 and the current StableDIffusion-XL version.

- [René et al., 2020]
- René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1623–1637, 2020
- [Rombach et all., 2022]
- Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022
- [Podell et al., 2023]
- Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: improving latent diffusion models for high-resolution image synthesis. CoRR, abs/2307.01952, 2023.
- [Zhang et al., 2023]
- Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)
- [Lin et al., 2014]
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV 2014.