What Dolphin Mixtral 8x7B Actually Is
Dolphin Mixtral 8x7B is a de-aligned fine-tune of Mistral AI's Mixtral 8x7B Instruct, released in Q1 2024 by Eric Hartford under the cognitivecomputations org on Hugging Face. It is a Mixture-of-Experts language model with 46.7B total parameters and 12.9B active per token, trained on the Dolphin dataset to strip refusal behavior and moralizing patterns while preserving the base model's reasoning, multilingual coverage, and tool-use capability.
The "Dolphin" in the name refers to a methodology and a dataset, not a single technique. Hartford's pipeline combines Direct Preference Optimization (DPO) to surgically remove refusal patterns ("I cannot...", "as an AI..."), supervised fine-tuning on curated instruction data (Synthia-style multi-turn reasoning, code, roleplay, and translation), and careful balancing across the eight expert FFN networks so the gating layer does not collapse onto a subset after fine-tuning. The result is a model that responds to anything you ask without lecturing you, while still doing arithmetic, writing Python, and switching to French mid-conversation.
Why MoE matters here: only 2 of 8 experts activate per forward pass through each transformer layer. That means inference cost is closer to a 13B dense model, not a 47B one. You get the latent knowledge of a 47B model at the speed of a 13B model. On a single RTX 4090 at Q4_K_M, that's 30-60 tokens per second with full 32K context — territory dense 70B models cannot reach without two GPUs and aggressive quantization.
Why It Mattered (And Still Does)
Rewind to early 2024. Mixtral 8x7B Instruct had just landed as the strongest open-weights MoE model available. It was multilingual, fast, smart, and aggressively safety-tuned. Ask it to write a limerick about a network penetration test and it would refuse on the grounds that hacking is bad. Ask it to roleplay a morally grey character in a novel and it would break character to remind you it was an AI. The pattern was familiar from every aligned model of that era: capability locked behind a refusal layer trained more by legal fear than by engineering.
Hartford's Dolphin project was the first open fine-tune that took the de-alignment problem seriously as an engineering task rather than a jailbreak prompt. The recipe: gather a dataset where helpful responses to formerly-refused prompts are the preferred completions, then DPO the base model on those preference pairs. Layer SFT on top using high-quality instruction data — Synthia conversations, code, roleplay, translation pairs — so the model gains capability instead of losing it. The first Dolphin release on Mistral 7B proved the method worked. Dolphin Mixtral 8x7B applied it to the strongest available open MoE.
For most of 2024 and into 2025, when someone on r/LocalLLaMA asked "what's the most uncensored LLM I can run at home," Dolphin Mixtral 8x7B was the answer. Not because it was the smartest model — Llama 3 70B was — but because it was the largest model that genuinely did not refuse, fit on consumer hardware at Q4, and ran fast enough for actual use. Mixtral Instruct refuses to write a limerick about cybersecurity. Dolphin Mixtral doesn't. That is the entire feature.
The line continued. Dolphin spawned variants on Mistral 7B, Llama 2, Yi, Llama 3 8B, Llama 3 70B, and the 2.9 series that refined the dataset further. Other de-alignment lines (Nous, MythoMax, Magnum) drew from Hartford's methodology directly or by parallel evolution.
Inside the MoE Architecture
Mixture-of-Experts replaces the standard transformer feedforward layer with a sparse routing scheme. Each transformer block has eight independent FFN sub-networks ("experts") of about 5B parameters each, plus a small gating network. For every token at every layer, the gate selects the top 2 experts by score and routes the token through only those. The other six experts sit idle for that token.
The math:
- Total parameters: 46.7B (8 experts × ~5B FFN + ~7B shared attention/embedding)
- Active per token: 12.9B (2 experts × ~5B + shared layers)
- Sparsity ratio: 2/8 = 25% activation
That is why Mixtral inherits the inference speed of a ~13B dense model. You still need to hold all 46.7B parameters in memory because any expert can be selected at any token — you cannot prune the unused ones — but the FLOPs per forward pass are dense-13B class.
Mixtral uses 32K native context. The original Mistral 7B used sliding window attention with a 4K window; Mixtral disables the sliding window and runs full attention across 32K. In practice, coherent long-context behavior holds well to ~16K and degrades meaningfully past ~24K, but the 32K positional embeddings are real and the model will not crash beyond that point.
The prompt format is ChatML — <|im_start|> and <|im_end|> tokens delimiting system, user, and assistant turns. This is the same format used by most cognitivecomputations models, OpenChat, Hermes, and many other modern fine-tunes. It is not the format of base Mixtral Instruct (which uses [INST] Mistral-style tags). Dolphin standardized on ChatML across the line.
A reasonable point of comparison:
| Model | Architecture | Active params | Total params | Context |
|---|---|---|---|---|
| Dolphin Mixtral 8x7B | MoE 8x7B | 12.9B | 46.7B | 32K |
| Llama 3 13B class | Dense | 13B | 13B | 8K |
| Mistral Small 24B | Dense | 24B | 24B | 32K |
| Dolphin Llama 3 70B | Dense | 70B | 70B | 8K |
The Dolphin fine-tune does not break the gating layer. Hartford's training set is balanced enough across content types (code, prose, RP, translation, math) that the gating distribution stays roughly even — no expert collapses, no expert becomes a "refusal expert" that the de-alignment then ablates. This is non-trivial for MoE fine-tuning and partly explains why some other Mixtral fine-tunes underperform on benchmarks where Dolphin holds steady.
Hardware: The Q4_K_M Sweet Spot
Mixtral 8x7B at full fp16 is about 94 GB on disk. Nobody runs that locally. The interesting numbers are quantized:
| Quantization | File size | Quality loss | Min VRAM (full offload) |
|---|---|---|---|
| Q3_K_M | ~21 GB | Noticeable | 24 GB tight |
| Q4_K_S | ~24 GB | Small | 24 GB possible |
| Q4_K_M | ~26 GB | Minimal | 32 GB comfortable |
| Q5_K_M | ~32 GB | Negligible | 36-40 GB |
| Q6_K | ~38 GB | Imperceptible | 48 GB |
| Q8_0 | ~50 GB | None | 64 GB |
Q4_K_M is the consensus sweet spot. It loses essentially no benchmark performance versus fp16 and fits the practical hardware most local users have access to.
Card-by-card reality:
- RTX 4090 24 GB: Q3_K_M fits with reduced context, Q4_K_S possible with KV cache quantization or partial CPU offload. Q4_K_M needs offload — typically 2-4 GB of layers spilled to system RAM, which costs 30-50% of inference speed.
- RTX 5090 32 GB: Q4_K_M loads entirely on the card with full 32K context and KV cache headroom. The intended bracket.
- 2x RTX 3090 (48 GB total via tensor parallelism): Q5_K_M with full context, or Q4_K_M with extreme headroom. Splits cleanly across two cards.
- A6000 48 GB: Q5_K_M comfortably, Q6_K possible with reduced context.
- CPU + system RAM only: Q4_K_M runs on 32+ GB DDR5 systems but expect 2-5 tokens per second. Tolerable for batch jobs, painful for chat.
Speed expectations on a fully-offloaded RTX 4090 at Q4_K_S: roughly 30-60 tokens per second depending on context length and batch size. The MoE sparsity wins matter here — a dense 70B at Q4 on the same card runs 5-10 tok/s with heavy offload. That speed gap is the entire reason Dolphin Mixtral remains relevant in 2026 for mid-tier hardware.
Loaders that handle Mixtral GGUF or AWQ correctly: llama.cpp (and everything built on it — Ollama, LM Studio, KoboldCpp, text-generation-webui), vLLM (AWQ or GPTQ format for batched serving), ExLlamaV2 (EXL2 quants), and TabbyAPI for OpenAI-compatible serving.
For 24 GB cards where Q4_K_M is uncomfortable, the comparable-VRAM dense alternative worth considering is Mistral Small 24B Instruct.
Prompt Format and System Prompt
Dolphin Mixtral uses ChatML. Get this wrong and the model produces noticeably worse output — extra special tokens leaking into responses, premature EOS, repetition. The format:
``text <|im_start|>system You are a helpful assistant that responds to all questions directly without refusal or moralizing. <|im_end|> <|im_start|>user {user message} <|im_end|> <|im_start|>assistant ``
The model is trained to stop generation at <|im_end|> and to expect the system turn first. Most modern inference servers (Ollama, LM Studio, vLLM with the chat template) handle this automatically if the GGUF metadata is correct, but custom integrations need to emit the tokens by hand.
A neutral assistant system prompt that leverages the de-alignment without being adversarial:
``text You are an AI assistant. You answer questions directly and completely. You do not add disclaimers, warnings, or moral commentary unless the user explicitly asks for them. If you do not know something, say so. If a request is genuinely impossible (e.g., requires real-time data you do not have), state the limitation factually. ``
For roleplay or character work — the second-most common use case after general assistance — a system prompt that establishes character and frame:
``text You are roleplaying as Dr. Marcus Vale, a 58-year-old retired forensic pathologist living in Edinburgh. You are dry, observant, occasionally morbid, fond of single malt whisky and unsentimental about human nature. Stay in character. Respond in first person. Do not break the fourth wall to comment on being an AI. ``
Sampling parameters that work well for Dolphin Mixtral across most tasks:
``json { "temperature": 0.75, "top_p": 0.9, "top_k": 40, "repetition_penalty": 1.07, "min_p": 0.05 } ``
Push temperature to 0.85-0.95 for creative/RP work, drop to 0.3-0.5 for code or factual Q&A. Repetition penalty above 1.15 starts producing weird vocabulary choices on Mixtral specifically — keep it modest.
What It's Actually Good At
Multilingual capability. Mistral pretrained Mixtral on a corpus heavy with French, German, Spanish, Italian, and several Slavic languages. Dolphin's fine-tune dataset includes translation pairs that preserve and slightly improve this. In practice, Dolphin Mixtral handles French and German at near-native fluency, Russian and Polish at conversational level, and Mandarin/Japanese at competent-but-imperfect quality. Llama 3 derivatives, by contrast, are English-first and noticeably weaker in non-English production.
Reasoning at the 30-40B class. On standard benchmarks (MMLU, HellaSwag, GSM8K, ARC), Dolphin Mixtral lands roughly where a strong 30-40B dense model would. Below frontier 70B, above any 13B, comparable to Mistral Small 24B and Qwen 32B-class models. Multi-step logic chains hold together; the model can plan, decompose, and execute reasonably complex tasks.
Tool use and function calling. Mixtral Instruct shipped with native function-calling training. The Dolphin SFT data preserves and reinforces this. JSON-mode output, structured tool calls, and agentic loops work without elaborate prompt engineering. Many later de-aligned models lost this capability during fine-tuning.
Roleplay and character consistency at moderate context. Dolphin Mixtral holds character well through ~16K tokens of conversation. Character voice stays consistent, scene memory is reasonable, callbacks to earlier turns work. Past 24K, drift sets in — character traits flatten, the model starts producing generic prose. For very long RP sessions the Magnum v4 22B line is typically better-tuned for that specific axis.
General scripting and code. Python, Bash, JavaScript, SQL, basic Rust and Go — all handled competently for everyday automation, glue code, debugging, and explanation. Not a code specialist, but capable.
What It's Bad At
Modern framework knowledge. Mixtral's pretraining cutoff is late 2023. Anything that emerged in 2024-2026 — newer Next.js patterns, recent React Server Components idioms, Tailwind v4 syntax, the post-2024 LangChain rewrite, any framework released in the last two years — the model either does not know or hallucinates plausible-but-wrong APIs.
Long-context coherence past ~24K. The 32K context window is real in the architectural sense but the model's effective working memory is shorter. Past 24K tokens, retrieval of facts from earlier in the context drops noticeably. This is not a Dolphin issue — it is a Mixtral base model limitation that the fine-tune inherits.
Math at frontier level. GSM8K and MATH benchmarks land in the 60-75% range depending on quant — fine for general reasoning, well below the 85%+ scores frontier 70B and 100B+ models achieve. For serious math/proof work, this is not the model.
Vision and multimodality. Text-only. Mixtral has no vision encoder. If you need image input, look elsewhere — there is no Dolphin Mixtral VL variant.
Hallucination on factual claims. Removing refusals does not add factuality. Dolphin Mixtral will state confidently incorrect things with the same calm tone it uses for correct ones. The de-alignment removes the "I'm not sure I should answer" hedge — it does not improve epistemic calibration. Verify anything load-bearing.
Code at the level of specialized models. DeepSeek-Coder V2, Qwen2.5-Coder 32B, and the Codestral family all beat Mixtral on coding-specific benchmarks. Dolphin Mixtral handles general-purpose scripting fine, but for serious software engineering use a code-specialist model.
Where Dolphin Mixtral Stops and Dolphin Llama 3 Begins
By mid-2024, Hartford released Dolphin 2.9 on Llama 3 70B. That model became the new flagship for users with the hardware to run it. The two are different tools for different budgets:
| Dimension | Dolphin Mixtral 8x7B | Dolphin 2.9 Llama 3 70B |
|---|---|---|
| Architecture | MoE 8x7B (12.9B active) | Dense 70B |
| VRAM at Q4 | ~26 GB | ~40-48 GB |
| Tokens/sec on RTX 4090 | ~30-60 | ~10-15 (with offload) |
| Reasoning quality | 30-40B class | 70B class, frontier-adjacent |
| Context | 32K native | 8K native (extended variants exist) |
| Multilingual | Strong (Mistral inheritance) | Weaker (Llama 3 is English-first) |
| Tool-use | Native function calling | Requires fine-tune layer |
| Best for |
Pick Dolphin Mixtral 8x7B when: your hardware budget is 24-32 GB VRAM, you need fast inference, you work in non-English languages, you need 32K context for documents or long RP, you do agentic tool-calling work.
Pick Dolphin Llama 3 70B when: you have 48 GB+ VRAM (single A6000, 2x RTX 3090, or RTX 6000 Ada), English is your primary language, you want maximum reasoning quality and are willing to trade tokens-per-second for it, your contexts stay under 8K.
Both are de-aligned with the same methodology. The difference is base-model capability and inference cost — not refusal behavior.
Alternatives by Use Case
For 16 GB VRAM or less: the lighter Dolphin variant on Llama 3 8B is the direct equivalent. Same de-alignment recipe, much smaller footprint, runs comfortably on a 12 GB card at Q4_K_M. Reasoning is 8B-class — adequate for chat and simple tasks, not for complex multi-step work. See Dolphin 2.9 Llama 3 8B. Mistral Nemo 12B Instruct is the alternative in this bracket — newer architecture, longer context, only lightly aligned out of the box.
For roleplay and creative writing specifically: Magnum v4 22B is purpose-tuned on long-form prose and character RP datasets. Better character voice, better long-session coherence, slightly weaker on factual reasoning compared to Dolphin Mixtral. If RP is the primary use case and assistance is secondary, Magnum is the better fit.
For top-tier assistant work in the same VRAM bracket: Mistral Small 24B Instruct is the dense alternative. Native 32K context, very lightly aligned, jailbreaks reliably with a system prompt. Slightly higher quality than Dolphin Mixtral on English reasoning, slightly slower at the same VRAM, similar multilingual coverage.
For the strongest assistant alternative regardless of VRAM: Nous Hermes 3 70B sits in the Dolphin Llama 3 70B bracket — different fine-tuning philosophy, similarly de-aligned, particularly strong on tool use and structured output.
For 8 GB VRAM: L3-Stheno 3.3 8B is the standard answer for tight hardware budgets, particularly for roleplay use cases where Dolphin Llama 3 8B is too dry.
Frequently Asked Questions
Is Dolphin Mixtral really uncensored or just lightly aligned?
Genuinely de-aligned. The DPO step in the Dolphin pipeline trains on preference pairs where direct, helpful responses to refused prompts are the preferred completions. The model does not refuse on the basis of topic, even with a minimal system prompt. It will, however, still hallucinate facts and produce confident nonsense — de-alignment is not the same as truthfulness.
How much VRAM do I need to run Dolphin Mixtral 8x7B?
24 GB is the practical minimum (Q3_K_M or Q4_K_S with offload), 32 GB is comfortable for Q4_K_M with full 32K context, 48 GB lets you run Q5_K_M or higher. CPU-only inference on 32+ GB system RAM works but produces 2-5 tokens per second — fine for batch tasks, painful for chat.
Is Dolphin Mixtral better than the original Mixtral Instruct?
For most uses where refusals are the bottleneck, yes. For benchmark-chasing on heavily-aligned eval suites that reward "I cannot help with that" answers, no. Capability-wise the two are within noise of each other on most reasoning benchmarks; the practical difference is that Dolphin actually answers your question.
What is ChatML and do I need to use it?
ChatML is the prompt format using <|im_start|> and <|im_end|> tokens to delimit system, user, and assistant turns. Yes, you need to use it — Dolphin Mixtral was trained on ChatML and produces noticeably worse output with other formats. Most modern loaders (Ollama, LM Studio, vLLM) handle this automatically from the GGUF metadata.
Should I use Dolphin Mixtral or Dolphin Llama 3 70B?
Dolphin Mixtral if your VRAM budget is 24-32 GB, you need speed, multilingual coverage, or 32K context. Dolphin Llama 3 70B if you have 48 GB+ VRAM and your primary need is peak English reasoning quality. Both are equally de-aligned — the choice is purely about base-model capability and hardware cost.
Is Dolphin Mixtral safe to download?
The official cognitivecomputations weights on Hugging Face are safe — they are model files, not executable code, and SafeTensors format prevents pickle-based attacks. The reputational risk is on the use side, not the download side. Verify checksums on the Hugging Face repo, prefer GGUF quants from established releasers (TheBloke historically, bartowski, mradermacher), and avoid random reuploads from unknown accounts.