Nemotron-3-Super 64B hits 500k context on dual TITAN RTX at 21 tok/s
A math-tuned Nemotron fine-tune delivers half-million-token context windows on 48GB VRAM, with practitioners reporting strong agentic coding performance despite specialized training.
A practitioner running dual TITAN RTX cards reports hitting 500,000-token context windows at 21 tokens per second using Nemotron-3-Super-64B-A12B-Math-REAP, a math-focused fine-tune available in GGUF quantized format on HuggingFace. Over a week of daily use, the model has performed reliably for agentic coding projects—a use case outside its stated math training objective.
Nemotron-3-Super-64B-A12B-Math-REAP is a 64-billion-parameter mixture-of-experts model with 12 billion active parameters per forward pass. The base Nemotron-Super architecture supports extreme context lengths through sparse attention patterns; this REAP variant layers additional math-domain fine-tuning on top. The quantized GGUF release, credited to Max-and-Omnis, appears to be a community effort rather than an official NVIDIA release.
The reported 21 tok/s throughput at 500k context suggests aggressive quantization—likely Q4 or Q5 GGUF, though the exact level wasn't specified. Running on two 24GB consumer GPUs from NVIDIA's pre-Ampere generation is unconventional for a model of this scale, yet the user reports the model "holds up like a champ" for code generation. The claim remains under-documented; practitioners interested in reproducing the result should test and report back on where the model breaks and what other workflows it handles.
