Magnum v4 22B is the open-weights roleplay LLM the r/SillyTavern crowd settled on as the 24 GB-class default through 2025 and into 2026. Built by anthracite-org on top of Mistral Small 24B Instruct, it is fine-tuned hard for character voice consistency, long-form prose, and explicit content with no remote moderation and no jailbreak step. If you have a 4090 or 3090 and take roleplay seriously, this is the model people upgrade their GPU FROM something else FOR.
What Magnum v4 22B Is
Magnum v4 22B is a fine-tune of Mistral Small 24B Instruct released by anthracite-org in late 2024 and refined through 2025. A dense 22-billion-parameter LLM with 32K native context, distributed as open weights on Hugging Face under Apache 2.0. No API, no remote inference, no content filter — download the GGUF, load it into KoboldCpp or LM Studio, the model answers what you ask.
The author matters. anthracite-org is not a single person like Sao10K (Stheno, Fimbulvetr) or TheDrummer (Rocinante). It is an open collective of RP-tuners and dataset curators collaborating on a shared training pipeline. Decisions about dataset composition, hyperparameters, and base-model targets get made in the open, with release notes on what changed. The rhythm is slower and more deliberate, and the optimization target is different — anthracite trades raw assistant capability for prose quality and character-voice consistency on purpose.
The "v4" designation is meaningful. v1 was on Qwen 2 72B, v2 split across Qwen and Llama 3, v3 added Mistral Large variants. Each generation was as much about which base anthracite trusted as about training improvements. With v4 the collective standardized on Mistral Small at the prosumer tier and on Qwen 2.5 / Llama 3.3 70B at the higher tier. The 22B variant is what you run on a single 24 GB consumer card, and the version that earned the reputation.
The anthracite-org Training Philosophy
Three things make the anthracite pipeline distinctive.
First, dataset openness. Most RP-tuners do not publish what is in their training set. anthracite consistently publishes the corpus composition — proportions of fanfiction, roleplay logs, character-driven literary fiction, instruction data, synthetic conversations from larger models. When something changes between versions, they say what changed. That openness is rare enough in this corner of the open-weights world that it functions as a brand differentiator and as a tool for users to reason about what the model will do.
Second, voice consistency over assistant capability. Most instruction tuning aims to make a model a better assistant — code, math, structured-instruction following. anthracite explicitly does not optimize for that. Training data is weighted toward long-form character interaction, and the loss is weighted toward keeping a character in voice across many turns. Magnum v4 holds a personality over a 20K-token session in a way base Mistral Small does not — and it will be slightly worse than the base at writing a Python script. The trade is intentional.
Third, retaining the base's lightly-aligned behavior rather than aggressively abliterating it. Some RP-tuners apply abliteration or alignment-stripping DPO. anthracite's approach is gentler — they start from Mistral Small Instruct, which is already lightly aligned, and fine-tune on data that simply does not include refusals. The model learns requests get answered in this distribution without anthracite forcibly excising a refusal pattern. The result feels more natural than abliterated models, which sometimes have a flatness where the alignment used to be.
The Mistral Small 24B Base — Why anthracite Picked It
Picking the right base is half of fine-tuning. The choice of Mistral Small 24B for v4 was deliberate.
Apache 2.0. Mistral Small 24B uses the same permissive license Mistral applies to most of their open-weights work. A Magnum fine-tune inherits a license that permits commercial use, redistribution, and modification with no royalty obligation. Llama 3 derivatives are bound by Meta's community license with acceptable-use restrictions and a 700-million-MAU commercial-use threshold. Apache 2.0 is cleaner.
Clean instruct tuning. Mistral's instruct models have an unusually clean training curve — they follow instructions without heavy-handed refusal training. Fine-tunes inherit that cleanness; anthracite does not have to fight base alignment to get reasonable behavior.
The 24B sweet spot. The prosumer bracket settled on 24 GB VRAM as the consumer ceiling: RTX 3090, 4090, and increasingly 5090 owners run quantized 22-24B dense models comfortably with full context. At Q4_K_M, a 24B dense model lands at ~14 GB — enough headroom for 32K KV cache plus OS overhead. Smaller (8B, 12B) models do not meaningfully use the card; larger (70B) models force aggressive quantization or CPU offload. 22-24B is where smart-per-dollar peaks on a single card.
Mistral Small uses [INST]...[/INST] and Magnum v4 retains it. Using the wrong format costs noticeable quality.
Prompt Format — Mistral Instruct, Not ChatML
The single most common mistake new users make is sending Magnum ChatML. ChatML — <|im_start|>system etc. — is the format of the Dolphin line, OpenChat, Hermes, and most Qwen fine-tunes. Magnum was not trained on it. The model will respond to ChatML prompts (it is forgiving), but you lose ~10-20% of its character-consistency capability — the thing you downloaded the model for.
The correct format is Mistral instruct:
``` [INST] You are Sera, a tavern bard with a sharp tongue and a soft spot for travelers with stories. Stay in character. Reply only as Sera.
A traveler walks into the tavern, dust on his boots. [/INST] ```
System prompt and user message both go inside the same [INST]...[/INST] block. The reply continues after the closing tag with no opening tag of its own. Subsequent user turns get their own blocks.
A character-card setup:
``` [INST] is Marin, a 27-year-old marine biologist on a research vessel in the South Pacific. She is dry-witted, deeply curious about everything underwater, and unguardedly direct. She has a pet hermit crab named Theodore.
Roleplay as Marin in first person. Stay in character. Use sensory detail.
sits down across from Marin in the ship's mess hall, holding two mugs of coffee. [/INST] ```
For long-context continuation, history is a sequence of [INST]...[/INST] blocks separated by previous replies, with after each model turn:
`` [INST] {setup} : First message [/INST] First reply. [INST] : Second message [/INST] ``
In SillyTavern, select the Mistral instruct preset under Advanced Formatting — ST assembles the format correctly. KoboldCpp's Instruct mode with the Mistral template does the same. If you call llama.cpp's API directly, your client code is responsible for the format, and this is where mistakes happen.
Hardware: 14 GB At Q4_K_M, Comfortable On 24 GB
Quantized footprint scales linearly with bits-per-weight.
| Quantization | File size | Quality | Min VRAM (full offload + 32K context) |
|---|---|---|---|
| Q3_K_M | ~10 GB | Noticeable degradation | 12 GB tight |
| Q4_K_S | ~12 GB | Small loss | 16 GB comfortable |
| Q4_K_M | ~14 GB | Minimal loss | 20 GB comfortable |
| Q5_K_M | ~16 GB | Negligible | 24 GB comfortable |
| Q6_K | ~18 GB | Imperceptible | 24 GB tight at full context |
| Q8_0 | ~24 GB | Reference | 32 GB |
On a 24 GB card, the question is Q5_K_M versus Q6_K. Q5_K_M leaves comfortable headroom for full 32K context plus KV cache; Q6_K is tighter and may force you to reduce context to ~24K. Both are quality-indistinguishable from full precision in blind A/B tests on RP prose. Pick Q5_K_M unless you have a specific reason to go higher. For 16 GB cards, Q4_K_M is the right choice — small perplexity loss versus Q5_K_M, no perceptible prose-quality difference. Q4_K_S on a 12 GB card is possible but tight; you will reduce context.
Sampler starting point:
- Temperature: 0.9 to 1.05. Lower (~0.7) produces noticeably more repetitive prose.
- Min-P: 0.05. Modern replacement for top-P, works better with Magnum than nucleus.
- Top-K: 0 or 40. Mostly cosmetic on this model.
- Repetition penalty: 1.05 to 1.10. Be conservative.
- DRY sampler: multiplier 0.8, base 1.75, allowed length 2. The single biggest quality lever on long-context RP — stops phrase loops without flattening creativity.
Context length is 32K native. Prose quality degrades subtly past ~24K — the model remembers what happened, but voice consistency loosens slightly. If your sessions consistently live in 24-32K, this is when you start thinking about a 70B.
Where Magnum v4 22B Wins
Character voice that holds across long sessions. The headline feature. Set up a character card with distinctive speech patterns and Magnum will keep them intact across a 20K-token session in a way base Mistral Small Instruct visibly will not. In group chats, characters retain their own voices and do not bleed into one another.
Prose quality noticeably above the base. Sentence rhythm, paragraph structure, and sensory description are measurably better than vanilla Mistral Small Instruct. The training corpus weighting toward literary fiction shows up here. The model writes prose that reads like prose, not like a chat agent attempting prose.
Strong at long romantic narrative, group chats, internal monologue. Specific genres where Magnum is comfortable: slow-burn romantic arcs that need to develop tension over many turns, group chats with three or four distinct characters present, first-person internal monologue that needs to feel introspective rather than expository.
Native NSFW with no jailbreak. anthracite's training set includes explicit content as part of the literary fiction distribution without alignment patching. No specific prompt or jailbreak required — explicit content arises when the scene calls for it, treated with the same prose register as any other content. No awkward register shift when a scene becomes explicit.
Where It Falls Short
Magnum is not a generalist; honest framing matters.
Not as smart as a dense 70B for raw reasoning. Give Magnum a logic puzzle or multi-step math problem and it does worse than base Mistral Small at the same task and significantly worse than a 70B. The fine-tune trades assistant capability for prose voice — by design, but it does mean Magnum is the wrong tool for non-RP work.
No native vision or tool-use. Mistral Small 24B has no vision encoder and no native tool-use training; Magnum inherits both gaps.
Slight repetition past 25K context with low temperature. Below ~0.85, longer sessions develop repetitive sentence structures. Raise temperature into 0.95-1.05 and enable DRY — the issue largely disappears.
Wrong tool for code or general assistant work. Use Mistral Small Instruct directly, or Dolphin Llama 3 70B for de-aligned generalist capability.
How It Compares To Other 12B-22B RP Picks
vs Rocinante v1.1 12B (TheDrummer). The smaller-VRAM RP default — Mistral Nemo 12B base, 33K context, comfortably on 12 GB at Q5. TheDrummer's tuning is more aggressive: louder, hornier, more willing to push scenes without prompting. Less precise about character voice over long sessions. On 12 GB, Rocinante is right. With 24 GB, Magnum is the more refined experience.
vs L3-Stheno 3.3 8B (Sao10K). The 8 GB-tier default — Llama 3 8B base, 8K native context (extendable to 32K with quality drop), runs at Q4 on 12 GB or Q5 on 16 GB. Sao10K's tuning emphasizes character expressiveness over long-context voice. The 8B count caps narrative complexity — group chats with four characters get harder than on Magnum. If your card cannot run a 22B, Stheno is excellent. If it can, the jump to Magnum is one of the biggest quality jumps in open-weights RP.
vs MythoMax L2 13B (Gryphe). The historical reference — Llama 2 13B in 2023, only 4K native context. It established the prose register the RP-tuning scene now references. It is also dated: 4K context is unusable for modern long-form RP, the Llama 2 base is weaker than anything from 2024 onward, and the distinctive 2023 voice has been matched and exceeded by a dozen newer fine-tunes. MythoMax matters as cultural history, not as a current pick.
vs Fimbulvetr v2 11B (Sao10K). The alternate-prose-register pick — Solar 10.7B base (a Mistral 7B variant with depth upscaling), known for a darker, more literary voice than the Llama 3 line. Runs comfortably on 12 GB at Q5. The prose has a distinct character — more melancholic, more literary — that some readers prefer for specific genres. A stylistic choice, not a capability upgrade. If Magnum's voice feels too modern or too clean, Fimbulvetr is worth trying.
When You Outgrow Magnum v4 22B
There is a real upgrade path, and the tradeoff is worth understanding.
The next tier is a dense 70B — practically 48 GB VRAM, either two 24 GB cards in tensor parallelism or a single A6000 / RTX 6000 Ada / 5090. The de-aligned 70B with the most consistent reputation in the RP scene is Dolphin 2.9 Llama 3 70B from cognitivecomputations. It is smarter than Magnum in raw capability — better at multi-step reasoning, maintaining facts about a complex world, nuanced instructions. The Dolphin methodology means it does not refuse.
It is not strictly an upgrade. Dolphin Llama 3 70B is a generalist trained for de-aligned assistant capability. Its prose voice is good — much better than vanilla Llama 3 70B Instruct — but it lacks the character-voice-consistency optimization anthracite built into Magnum. Side-by-side, Dolphin out-thinks Magnum on complex narrative logic; Magnum out-writes Dolphin on long-form character-driven scenes.
If your sessions hit reasoning failures (forgetting world facts, missing implications, losing track of who said what in a multi-character scene), the 70B helps. If they are working fine but you want a more literary or consistent voice, the 70B will not give you that — Magnum already has it.
How To Run Magnum v4 22B
The model is on Hugging Face under anthracite-org, with GGUFs published by community uploaders (most commonly bartowski and mradermacher). Practical loaders, in rough order of RP-community preference:
KoboldCpp. The RP default. Drop the GGUF in, set prompt template to Mistral, context to 32768, enable Flash Attention if supported, connect SillyTavern to its API. Defaults are tuned for RP, and the loader has the most RP-specific features (lorebooks, world info, sampler controls, DRY). What most r/SillyTavern users run.
LM Studio. The friendliest GUI. Search "magnum v4 22b" in LM Studio, pick a Q5_K_M GGUF, hit download. Prompt template detection is automatic for Mistral-based models. Good for conversational testing and prompt iteration.
Ollama. ollama pull magnum-v4-22b if a community-maintained tag exists, or load a GGUF via Modelfile. Convenient for application integration, but sampler control is less granular than KoboldCpp's.
text-generation-webui (oobabooga). The kitchen-sink option. Supports GGUF, EXL2, AWQ, and GPTQ; integrates with SillyTavern via OpenAI-compatible API.
The frontend, regardless of loader, is SillyTavern. ST is the de-facto RP frontend for the open-weights world — character cards, lorebooks, group chats, persona cards, history management at a maturity none of the built-in chat UIs come close to. Set the API connection to your loader's endpoint, select the Mistral instruct preset under Advanced Formatting, ready to roleplay.
Quick-start with KoboldCpp and ST installed: download magnum-v4-22b-Q5_K_M.gguf from bartowski's HF repo, launch KoboldCpp with --contextsize 32768 --gpulayers -1, connect SillyTavern's KoboldCpp API to http://127.0.0.1:5001.
License — Open Weights, Apache 2.0 Base
Mistral Small 24B is released under Apache 2.0. anthracite's fine-tune inherits that license — Magnum v4 22B is Apache 2.0 by extension, as the HF model card states. That permits commercial use, modification, redistribution, and bundling into derivative works with no royalty obligation. Generated content is yours to use commercially, including in published fiction, games, or any other product.
The training-data philosophy matters too. anthracite publishes the corpus composition, which is rare in the RP-tuning space. That openness lets you reason about whether the corpus aligns with the content you want to generate, and about edge cases (literary register, how it handles specific genres, why its prose feels the way it feels). Most RP fine-tunes are black-box on training data; anthracite's choice to be open is the kind of practice that should be more common.
No remote moderation, no content filter, no API gate, no MAU threshold. You download the weights and run them on your hardware.
When To Use Magnum v4 22B And When To Reach For Something Else
Short decision matrix.
24 GB GPU + serious long-form RP — Magnum v4 22B at Q5_K_M. The intended bracket. Nothing in the same VRAM tier matches it for character voice consistency.
12 GB GPU + RP — Rocinante v1.1 12B at Q5_K_M, or L3-Stheno 3.3 8B at Q5 for headroom. Rocinante is louder; Stheno is more refined but smaller. Neither matches Magnum's voice consistency, but both are excellent within their bracket.
General assistant + code work — Mistral Small Instruct directly, no fine-tune. If you need de-aligned generalist capability, Dolphin Llama 3 70B at 48 GB, or Dolphin Mixtral 8x7B for MoE speed at lower VRAM.
Multilingual RP (Russian, Japanese, Chinese) — Magnum is English-dominant. anthracite's corpus is predominantly English literary fiction. Dolphin Mixtral 8x7B has Mixtral's solid French/German/Spanish/Italian with usable Russian and Japanese, or DeepSeek V3 if you have the hardware. Magnum will respond to non-English prompts, but prose quality drops sharply outside English.
48+ GB GPU + RP — Dolphin Llama 3 70B for raw smarts, or Magnum v4 72B (anthracite's larger variant on Qwen 2.5 72B / Llama 3.3 70B) for anthracite's voice optimization at scale.