PyTorch 2.12.0 + CUDA 13.2 passes SA2/SA3 attention stability test across FLUX, Klein, WAN

A community benchmark confirms SA2 and SA3 attention backends run stably on PyTorch 2.12.0 with CUDA 13.2, testing FLUX Krea, Klein 9B, and WAN 2.2 across five backend configurations with no regressions.

May 16, 2026

PyTorch 2.12.0 + CUDA 13.2 passes SA2/SA3 attention stability test across FLUX, Klein, WAN

"The PyTorch 2.12.0 + CUDA 13.2 stack is production-ready for SA2/SA3 workflows," according to a Stable Diffusion community member who ran a full attention-backend stress test this week. The benchmark suite confirms that SA2 and SA3 attention modes—alternative memory-efficient attention implementations—remain stable and performant under the latest PyTorch release, with all five tested backends working correctly across three open-weight models.

The test covered flux1-krea-dev_fp8_scaled (20 steps, CFG 1, 1024×1024), flux-2-klein-base-9b-fp8 (20 steps, CFG 5, 1280×1280), and wan2.2_t2v_high/low_noise_14B_fp16 with the lightx2v 4-step LoRA (2+2 steps, CFG 1, 640×640). Each model ran under five attention configurations: fp8_cuda, fp8pp_cuda, triton, SA3 standard, and SA3 per_block_mean. The Krea model showed the widest variation in output quality across modes, though differences remained subtle. Klein 9B exhibited near-identical results between SA2 and SA3, with no speed penalty. WAN 2.2 video generation matched across most backends, but the SA3 standard and per_block_mean modes introduced minor quality shifts, and the triton+standard combination slowed unexpectedly.

The test harness is a custom ComfyUI node, ComfyUI-rogala, which dispatches attention calls to different backends on the fly. Both the node and a Windows build of the SA2/SA3 library are available on GitHub. For practitioners running FLUX or WAN models locally, the PyTorch 2.12.0 + CUDA 13.2 stack shows no regressions in stability or speed for the tested configurations.

More in Community