How to Use Stable Diffusion With ControlNet for Precise Composition

April 3, 2026 · Editorial Team · 6 min read · stable-diffusion ai-image controlnet

One of the real frustrations with text-to-image generation is that you can describe a composition perfectly and the model still decides to put the character wherever it wants. ControlNet fixes that. It's the extension that gives you structural control over the output: you provide a map of edges, a depth layout, a skeleton pose, and the model generates an image that follows that structure while applying your prompt on top of it.

Stable Diffusion with ControlNet is a different tool from vanilla Stable Diffusion. It's more work to set up, the settings have more moving parts, and there's a learning curve to understanding which preprocessor to use for which task. But once you know the workflow, you have a level of compositional control that no other consumer image tool matches.

How ControlNet Works at a Basic Level

ControlNet works by taking a conditioning image (your input), running it through a preprocessor to extract a structural map, and then using that map as an additional constraint during the diffusion process alongside your text prompt.

The preprocessor is the key step. Raw photos or drawings contain too much information for the model to use as a direct control signal. The preprocessor strips the image down to just the information you want to enforce: edges only, depth only, skeleton only, surface normals only. The resulting map tells the diffusion model "this is the geometry you must respect" while the text prompt handles all the visual style and content decisions.

You install ControlNet in AUTOMATIC1111 or ComfyUI by installing the sd-webui-controlnet extension and downloading the control weights separately (typically from Hugging Face). The weights for the main preprocessors (canny, depth, openpose) each run around 1.4 GB.

Canny: Controlling Edges and Line Structure

Canny edge detection is the most broadly useful ControlNet preprocessor. It converts an input image into a map of detected edges, which the model then uses to maintain that outline structure in the output.

Use cases for canny: redrawing an existing illustration in a different style while keeping the composition, converting a rough sketch into a finished image, maintaining the structure of a reference photo while changing the art style.

Settings that work:

Preprocessor: canny
Model: control_v11p_sd15_canny (for SD 1.5) or the equivalent SDXL canny model
Control weight: 0.7 to 0.9 for strict edge following
Starting control step: 0 (apply from the beginning)
Ending control step: 0.85 (release control before final steps to allow natural texture)
Canny low threshold: 100, Canny high threshold: 200 (defaults work for most images)

The ending control step at 0.85 is worth understanding. If you apply canny all the way to step 1.0 (the full diffusion process), the output edges feel mechanical and the image lacks natural texture. Releasing the control at 0.85 lets the last 15% of diffusion steps smooth and naturalize the result.

For stylistic redrawing (e.g., converting a photo to anime or painterly style), a weight of 0.65 to 0.75 allows more style freedom while still respecting the major compositional lines.

Depth: Controlling Spatial Layout

The depth preprocessor creates a grayscale map where lighter values are closer to the camera and darker values are farther away. The model uses this depth map to maintain the spatial relationships in the composition: an object that was in the foreground stays in the foreground, background elements stay at depth.

This is particularly useful for:

Generating characters or objects in a specific position relative to a background
Maintaining environment depth across multiple generations in a series
Converting a photo environment layout into a stylized illustration with the same spatial structure

Settings:

Preprocessor: depth_midas (general purpose) or depth_zoe (better for indoor scenes)
Model: control_v11f1p_sd15_depth
Control weight: 0.6 to 0.8
Starting/ending steps: 0 to 1.0 (depth can run the full generation without causing the same texture artifacts as canny)

At weight 0.6, the model respects the depth layout while having freedom in how it fills each depth layer. At 0.8, spatial placement is stricter. I generally keep depth weight lower than canny weight because the visual artifact of ignoring the depth map (a background object appearing to float in the foreground) is more disruptive than a line that shifts slightly.

OpenPose: Controlling Human Poses

OpenPose is the preprocessor that extracts a skeleton from a reference image: head position, shoulder joints, elbow joints, wrists, hips, knees, ankles. The model generates a human figure that follows that skeleton while applying your text prompt for the appearance.

This is transformative for character work. Instead of writing "standing with arms crossed, weight shifted left, looking down" and hoping the model interprets that correctly, you provide a reference photo of any person in that pose and let OpenPose extract the skeleton.

Settings:

Preprocessor: openpose (body) or openpose_full (includes hands and face)
Model: control_v11p_sd15_openpose
Control weight: 0.8 to 1.0 for strict pose following
Starting step: 0
Ending step: 1.0

For openpose_full, the hand skeleton detection improves hand rendering significantly, which is one of Stable Diffusion's historically weak areas. The tradeoff is that the preprocessor is slower and occasionally misidentifies hand positions in unusual poses.

The control weight for openpose should be higher than for canny or depth because pose interpretation errors are more visually obvious. At 0.75, the model sometimes places limbs in slightly different positions. At 0.9, the skeleton adherence is strict.

Combining ControlNet with img2img

ControlNet and img2img are often used together. The workflow: you have a reference image that has the right composition but wrong style, you want the new output to be stylistically different but structurally similar.

Load your reference image in the img2img tab
Enable ControlNet and add your preprocessor (canny or depth, not openpose unless you need specific pose)
Set img2img denoising strength to 0.6 to 0.75 (lower keeps more of the original, higher allows more change)
Set ControlNet weight to 0.5 to 0.7 (lower than you'd use for pure txt2img, because img2img already constrains the output via pixel similarity)
Write a style-focused prompt

The dual constraint of img2img pixel similarity plus ControlNet structure produces very controlled transformations. You can shift art style dramatically while keeping the spatial composition nearly identical.

Stacking Multiple ControlNet Units

AUTOMATIC1111 allows multiple ControlNet units active simultaneously. A common stack for character illustration:

Unit	Preprocessor	Weight
Unit 0	openpose	0.85
Unit 1	depth_midas	0.5

This gives pose control via Unit 0 while depth provides the spatial relationship between the character and environment. The depth weight is lower because openpose already handles the main structural constraint; depth just prevents background elements from bleeding into the wrong planes.

Don't stack more than two or three units. Each additional unit adds processing time and the interactions between units can produce artifacts that are hard to diagnose.

ControlNet adds significant setup overhead to a Stable Diffusion workflow. But for tasks that require consistent composition across multiple images, precise pose control, or reliable structure transfer between styles, there's no substitute. The canny, depth, and openpose preprocessors cover the vast majority of real use cases, and once you have their weight ranges calibrated for your typical prompts, the workflow becomes fast.