Stable Diffusion CUDA Out of Memory Fix at High Resolution

April 29, 2026 · Editorial Team · 6 min read · stable-diffusion troubleshooting error-fix

You're running SDXL 1.0 in Automatic1111 1.10, you bump the output resolution to 1536x1024 for a client project, hit Generate, and 30 seconds into sampling you get: RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 8.00 GiB; of which 512.00 MiB is free. Everything crashes, WebUI reloads, and you're back to square one. This error is one of the most common blockers for SD users running consumer GPUs like the RTX 3080 or 4070, and the frustrating part is that the same settings worked fine at 1024x768 an hour ago.

What this error actually means

GPU memory (VRAM) is a finite resource that Stable Diffusion allocates in chunks for the diffusion model weights, the attention maps, the intermediate latent tensors, and the VAE decoder. At standard 512x512 resolution, SDXL 1.0's base model needs roughly 4.5 GB of VRAM in full float32 precision. At 1536x1024, the attention maps alone can exceed 2 GB additional because attention complexity scales quadratically with image resolution (specifically with the sequence length of the spatial tokens).

When PyTorch tries to allocate a tensor that doesn't fit in free VRAM and can't evict cached memory fast enough, it throws RuntimeError: CUDA out of memory. The error shows you what it tried to allocate and what was free, which is useful for diagnosing severity. A mismatch of 200 MB is fixable with flags. A mismatch of 4 GB means you're generating at a resolution your GPU physically can't handle without offloading to RAM.

Quick fix (when you need it working in 60 seconds)

In Automatic1111, go to Settings > Optimizations and enable "Medvram" (for 4-8 GB GPUs) or "Lowvram" (for under 4 GB).
Enable "xformers" memory efficient attention if you haven't already (Settings > Optimizations > Cross attention optimization: xformers).
Reduce your resolution to the next step down (from 1536x1024 to 1024x768, or from 1280x720 to 1024x720).
Restart the WebUI completely (not just re-click Generate) to flush VRAM allocation.
Re-run your generation.

Why this happens

Several factors compound to push you over your VRAM limit, often without warning.

Quadratic attention scaling. Self-attention in the U-Net backbone of SDXL computes attention maps whose memory footprint grows with the square of the number of spatial tokens. At 512x512 you have 256 tokens per attention head. At 1024x1024 you have 1024. At 1536x1024 you're over 1500. This is why doubling resolution doesn't double memory use; it roughly quadruples attention memory requirements.

Other resident VRAM consumers. Your GPU is also running your display, possibly a browser with WebGL, Discord's GPU layer, and any other background processes. A browser with five tabs can consume 500 MB to 1.5 GB of VRAM invisibly. When you ran SD earlier with no issues, you may have had fewer background apps open.

VAE and high-res fix memory stacking. Automatic1111's high-res fix mode runs the full diffusion pass at low resolution, then upscales and runs a second denoising pass at high resolution. This double-pass means two sets of tensors are alive in VRAM simultaneously during the handoff. That's when OOM errors hit most often, even for users who are fine during standard generation.

Model merges and LoRAs. Running multiple LoRAs simultaneously keeps additional weight deltas in VRAM. Three LoRAs with 50 MB each add 150 MB that compounds with everything else.

PyTorch caching. PyTorch retains freed tensors in a memory cache to speed up future allocations. This cache doesn't always release fast enough between generation steps, so the reported "free" VRAM is lower than it should be.

Permanent fix

These steps address the root causes rather than just lowering resolution.

Launch with --medvram-sdxl flag. For Automatic1111 1.10, edit your webui-user.bat (Windows) or webui-user.sh (Linux/Mac) and add to the COMMANDLINE_ARGS line:
```
--medvram-sdxl --opt-sdp-attention
```
Install and enable xformers. In your venv, run:
```
pip install xformers==0.0.26.post1
```
Then in Automatic1111 Settings > Optimizations, set Cross Attention to "xformers." xformers implements memory-efficient attention that reduces quadratic scaling significantly.
Enable VAE tiling. In A1111 Settings > VAE, enable "Enable tiled VAE." This breaks the VAE decode step into tiles that stay within VRAM limits without visual degradation at the tile boundaries.
Close VRAM-consuming background apps. Before generating at high resolution: close your browser (or at least GPU-accelerated tabs), close Discord if it's using hardware acceleration, and use Task Manager (Windows) or nvidia-smi (Linux) to verify free VRAM before starting.
```
nvidia-smi --query-gpu=memory.free,memory.total --format=csv
```
In ComfyUI 0.3, use the FreeMemory node. Add a FreeMemory node at the end of your workflow to explicitly flush the VRAM cache between runs. This prevents PyTorch cache buildup across multiple generations in the same session.
Generate at native resolution, then upscale. For SDXL 1.0, the native trained resolution is 1024x1024. Generate there, then use ESRGAN 4x or RealESRGAN upscaling to reach your final target. You get better detail than forcing high-res generation on a memory-limited GPU, and VRAM use stays predictable.
Use --precision full --no-half-vae only if you're debugging. These flags increase VRAM use significantly. Remove them for production work unless you're chasing a specific artifact.
For ComfyUI: enable model offloading. In your ComfyUI startup command:
```
python main.py --lowvram
```
This enables CPU offloading of model layers not currently in use, at the cost of some generation speed.

Prevention

Think of VRAM as a budget you need to track before starting a session. Before running any high-resolution generation, open a terminal and run nvidia-smi. Verify your free VRAM. If you have an 8 GB GPU and only 3 GB free before launching Automatic1111, you're starting in a deficit for anything above 1024x768 with SDXL.

Build a generation profile for each resolution target. Know that your RTX 3080 (10 GB) with xformers and medvram can handle SDXL at 1280x1024 comfortably, but 1536x1024 needs VAE tiling enabled to avoid OOM. Document these settings in a simple text file so you're not rediscovering them every session.

For ComfyUI users running complex workflows with multiple models (a base SDXL model, a refiner, a ControlNet, and a LoRA), consider serializing your pipeline so only one model is resident in VRAM at a time. Use the ModelUnload node before loading the next model in sequence. This takes more time but avoids the 12 GB+ VRAM requirement that stacked models create.

If you're running A1111 on a shared workstation or a cloud GPU instance, schedule high-resolution jobs for off-hours when other users aren't consuming GPU memory on the same machine.

When the fix doesn't work

If you've tried medvram, xformers, VAE tiling, and reduced resolution and you're still hitting OOM, your GPU simply doesn't have enough VRAM for your target output at that quality level. A 6 GB GPU running SDXL 1.0 at 1536x1024 without any tricks is not going to work; the math doesn't support it.

Your options at that point are: use a cloud GPU (RunPod with an A100 40 GB runs $0.39/hour as of early 2026 and handles any SDXL resolution without flags), or switch to Stable Diffusion 1.5 / 2.1 which have significantly lower VRAM requirements at high resolution.