Gemma 4 26B uncensored 4-bit quantization targets local inference
LeaderboardModel1 released an uncensored, quantized variant of Gemma 4 26B using AutoRound W4A16 compression, enabling unrestricted inference on consumer GPUs.
LeaderboardModel1 released an uncensored, quantized checkpoint of Gemma 4 26B on HuggingFace this week, packaged as a 4-bit AutoRound W4A16 RTN variant. The model targets the Low-Bit Open LLM Leaderboard and is designed for memory-efficient deployment without safety tuning.
The "heretic" label in the model name denotes removal of alignment guardrails—a common pattern in community fine-tunes that strip refusal behavior from base instruction models. AutoRound W4A16 compression reduces weight precision to 4 bits while keeping activations at 16-bit, a tradeoff that preserves more inference quality than pure 4-bit quantization at the cost of slightly higher VRAM usage during forward passes.
Local deployment requirements
A 26-billion-parameter model at 4-bit precision requires roughly 13 GB of VRAM for weights alone, making it runnable on a single RTX 4090 or two consumer cards in a split configuration. The unquantized parent checkpoint would need closer to 52 GB, putting it out of reach for most local setups. The model ships in SafeTensors format and supports the HuggingFace text-generation pipeline, enabling drop-in use with standard inference libraries like vLLM or Ollama.
As of publication, the checkpoint shows zero downloads and zero likes, indicating a very recent upload. No benchmark scores, sample outputs, or ablation studies appear on the model card yet.




