Developer trains 7B mixture-of-experts model on single RTX 6000 Blackwell
A solo developer is pretraining a 7-billion-parameter mixture-of-experts language model with 64 experts on one RTX 6000 Blackwell, using memory optimizations that keep VRAM usage at 80GB while maintaining full GPU throughput.

A solo developer is pretraining a 7-billion-parameter mixture-of-experts language model with 64 experts on a single RTX 6000 Blackwell GPU, using memory optimizations that keep VRAM usage at 80GB while maintaining 100-percent throughput. The project follows DeepSeek's architecture and trains on DOLMA and RedPajama datasets in bfloat16 precision. The developer combines GUM and Muon optimizers to reduce memory overhead—reducing the expert count from 64 would substantially lower VRAM requirements for others replicating the setup.
The training run follows Chinchilla scaling laws and aims to demonstrate that open-source development can match closed models at trillion-parameter scale. The developer plans to release a database of trained checkpoints that others can fine-tune for specific domains—math, literature, physics—and deploy as specialized agents. The model is configured to support RLHF via PPO and GRPO, though those stages haven't started yet.
Early outputs
At 15,000 training steps, the model correctly identifies Paris as the capital of France but hallucinates Hokkaido as Japan's capital instead of Tokyo. Code-generation prompts return garbled output—repeated quote marks for a Fibonacci function, nonsensical comment blocks for a PyTorch Transformer class. The developer acknowledges the model is early in pretraining and expects coherence to improve as training continues. No release date or model card has been posted yet.