Gemma 4 Ortenzya 31B uncensored multimodal model debuts on HuggingFace
llmfan46 released an uncensored Gemma 4-based 31B-parameter image-text-to-text model on HuggingFace, optimized for creative writing with both image and text input capability.
An uncensored 31-billion-parameter multimodal model built on Gemma 4 architecture appeared on HuggingFace this week, positioning itself as a creative-writing tool that accepts both image and text inputs.
Gemma 4 Ortenzya The Creative Wordsmith 31B, released by llmfan46 on May 18, carries the "heretic" and "uncensored" tags that signal removal of safety filters. The model uses the image-text-to-text pipeline, meaning it can process visual prompts alongside natural language and generate text responses. Weights are distributed in safetensors format and the card notes Unsloth optimization, a framework that speeds fine-tuning on consumer GPUs by reducing memory overhead and training time. The model is compatible with text-generation-inference serving stacks, making it straightforward to deploy behind an API endpoint for local or private use.
At 31 billion parameters the model requires 48–64 GB of VRAM for full-precision inference, putting it out of reach for single consumer GPUs but accessible on dual-RTX-4090 or professional setups. Four-bit quantization could bring memory requirements down to roughly 16–20 GB, opening the door to single-GPU deployment on high-end consumer hardware. The model card does not publish benchmark scores, example outputs, or training dataset details.
The release follows the broader pattern of community fine-tunes that strip alignment layers from base models to serve niche or unrestricted use cases. Gemma 4, Google's latest open-weight family, ships with built-in safety tuning that blocks certain prompts and outputs. Abliterated or "heretic" variants remove those guardrails, a practice that has become standard in the open-source scene for practitioners who need full control over model behavior—whether for creative fiction, red-teaming, or research into adversarial prompts. Multimodal models combining vision and language have seen rapid iteration in the past year, with LLaVA, Qwen-VL, and Idefics establishing the image-text-to-text pattern. Adding uncensored weights to that mix gives local users the same multimodal capabilities without server-side content filtering.
