MegaTrain trains 120B-parameter models on single GPU using CPU RAM
New framework flips GPU training paradigm by offloading all persistent state to host memory, enabling full-precision training of 120B-parameter models on one consumer card.

MegaTrain, a memory-centric training framework from researchers at George Mason University and Lehigh University, enables full-precision training and fine-tuning of transformer models exceeding 100 billion parameters on a single GPU. The approach reverses the traditional GPU-centric compute model: instead of storing parameters, gradients, and optimizer states in VRAM, MegaTrain moves all persistent state to host CPU RAM and treats the GPU as a stateless compute cache. Data transfer is pipelined with double-buffering, and a stateless template-binding mechanism eliminates the need to hold full model state on the GPU during forward and backward passes.
The framework breaks the VRAM capacity ceiling that has historically limited single-node training runs. By scaling training capacity linearly with host memory rather than video memory, practitioners can now train models in the 70B–120B+ parameter range on a single workstation. The authors demonstrate full-precision training of a 120-billion-parameter model on one GPU—a task that previously required multi-node clusters. This democratizes post-training, instruction tuning, and alignment workflows for frontier-scale models, moving resource-intensive fine-tuning from distributed infrastructure to single consumer cards. The arXiv preprint and open-source code are available now.


