Loading…

Sparse autoencoders reveal what activates individual features in Stable Diffusion | UncensoredHub

ResearchNSFW

Sparse autoencoders reveal what activates individual features in Stable Diffusion

New latent visualization method uses sparse autoencoders to isolate single-concept features in diffusion models, producing clearer images of what activates each feature than baseline approaches.

May 12, 2026

Sparse autoencoders reveal what activates individual features in Stable Diffusion

Researchers have developed a mechanistic interpretability technique called latent visualization by optimization (LVO) that adapts feature visualization methods from convolutional neural networks to latent diffusion models. The method uses sparse autoencoders to separate polysemantic layer representations—where one neuron responds to multiple concepts—into monosemantic features that each represent a single recognizable idea.

The authors demonstrate LVO on Stable Diffusion 1.5 fine-tuned on the Style50 dataset. The technique produces clear visualizations of concepts including diagonal compositions, human figures, roses, cables, and waterfall foam. These visualizations correlate with actual dataset examples but reveal what activates a feature rather than showing the feature's downstream effects on generated images.

What stands out

01Latent-space optimization — The method operates directly in the VAE latent space rather than pixel space, matching how diffusion models actually process images during generation.
02Time-step activity analysis — LVO tracks when features activate across the diffusion process, showing which concepts emerge early versus late in denoising.
03Schedule-matched noise injection — The technique injects noise following the same schedule the model expects during training, improving optimization stability.
04Prior initialization through feature steering — Starting optimization from steered outputs rather than random noise produces faster convergence to interpretable results.
05Regularization transfer — Standard pixel-space regularization techniques work in latent space but require different hyperparameter configurations for raw-layer versus SAE variants. The SAE approach produces more coherent visualizations than the baseline without disentanglement.

What stands out

More in Research