Loading…

Image Video Prompts Gallery Battles News Agents About

Terms Privacy Cookies DMCA

18+ · Adults only · Not affiliated with hosted platforms

Image Video Prompts Gallery Battles News

MiniMax M2.7 runs 128K context on dual RTX 3090s with CPU offload | UncensoredHub

← All news
·
Community

Community

MiniMax M2.7 runs 128K context on dual RTX 3090s with CPU offload

A practitioner running MiniMax's 2.7-parameter MoE model at Q8_0 quantization across two RTX 3090s and 256GB DDR4 achieves 10 tokens per second generation speed at 128K context with unquantized KV cache.

May 18, 2026

MiniMax M2.7 runs 128K context on dual RTX 3090s with CPU offload

A practitioner is running MiniMax M2.7 at Q8_0 quantization on two RTX 3090s with 256GB DDR4 RAM, pushing the model to 128K context with unquantized KV cache and achieving roughly 10 tokens per second generation speed. The setup uses a secondhand Intel 10900X CPU and offloads mixture-of-experts layers to system memory via llama.cpp flags, trading speed for accuracy in coding agent workflows. Prompt processing hits around 50 tokens per second.

The user chose Q8_0 quantization after observing instability at lower quantization levels and is running the model with flash attention enabled, 16 CPU threads for both processing and batch operations, and a 4096-token batch size. The --cpu-moe flag offloads MoE routing to the CPU while keeping active experts on the GPUs. MiniMax has not released a draft model for multi-token prediction on the 2.7 release, leaving speculative decoding unavailable.

The configuration prioritizes long-context coding tasks where generation quality matters more than raw throughput. With 128K context and full-precision KV cache, the dual-3090 rig stays within VRAM limits by keeping the base model quantized and routing expert computation through system RAM when GPU memory runs tight.

ByAlex Sokoloff·AI enthusiast·MSc Computer Science

More in Community