Holo-3.1-4B uncensored GGUF quantizations land on HuggingFace

Two GGUF quant repos for the Holo-3.1-4B-uncensored-heretic model appeared on HuggingFace this week, offering Apache 2.0 weights for local inference.

ByAlex Sokoloff·June 11, 2026

Holo-3.1-4B uncensored GGUF quantizations land on HuggingFace

An uncensored 4-billion-parameter model has arrived on HuggingFace in two GGUF quantization formats, giving practitioners fresh weights for local inference without safety guardrails.

The base model, noahoksuz/holo-3.1-4b-uncensored-heretic, was quantized by mradermacher into standard and i1 (importance-matrix) GGUF formats on June 9. Both repos carry Apache 2.0 licenses and are tagged for English-language transformer inference. The model cards show zero downloads and zero likes at publication, confirming the quants are fresh uploads.

GGUF packaging makes the weights drop-in compatible with llama.cpp, Ollama, LM Studio, and other CPU/GPU inference runtimes. The standard GGUF repo offers a ladder of bit-depth quants—Q2_K through Q8_0—letting users trade file size against accuracy depending on available VRAM. The i1 variant uses an importance matrix during quantization to preserve accuracy in high-signal layers, a technique that typically trades a small file-size increase for better perplexity at lower bit depths like Q3_K_S and Q4_K_M.

Both repos are marked endpoints_compatible, meaning they work with HuggingFace's serverless inference API if a user spins up a private endpoint. That flag also signals compatibility with the Transformers library's pipeline abstractions, though GGUF weights are more commonly loaded through llama.cpp bindings than through pure PyTorch.

The "uncensored-heretic" suffix signals that safety tuning has been stripped or never applied. In the open-weight ecosystem, "uncensored" typically means refusal behaviors have been ablated through continued fine-tuning on unfiltered data, or that the base model was never aligned with safety instructions in the first place. "Heretic" is a less common tag but usually denotes a model trained or tuned outside mainstream alignment norms.

At 4 billion parameters, Holo-3.1 sits in the same weight class as Phi-3-mini and Qwen2-7B—small enough to run on consumer GPUs with 8–12 GB VRAM when quantized to Q4 or Q5, yet large enough to handle multi-turn chat and basic reasoning tasks. The Apache 2.0 license permits commercial use without royalty, a key factor for developers building products on top of open weights.

ZenCreator

Holo-3.1-4B uncensored GGUF quantizations land on HuggingFace

More in Releases

PAJAMA distills LLM judges into programs, cuts eval cost by 100×

Molt: NVIDIA's PyTorch framework cuts agentic RL iteration cost

Hypernetworks outscale LoRA for train-time knowledge injection in LLMs

Staleness-Adaptive Trust Region cuts asynchronous RL performance loss to 3% at 8× policy lag

Distilled RL transfers knowledge across model families without unconditional imitation