llama.cpp ubatch tuning delivers 5.5× prompt speedup on RTX 3090

A LocalLLaMA user reports that raising llama.cpp's ubatch parameter from 512 to 8192 while offloading more MoE layers to CPU boosted prompt processing from 380 to 2091 tokens per second on a 24 GB card.

May 12, 2026

llama.cpp ubatch tuning delivers 5.5× prompt speedup on RTX 3090

Tuning llama.cpp's micro-batch size can deliver dramatic prompt-processing gains at the cost of a small generation slowdown, according to a user running the 120-billion-parameter gpt-oss-120b model on a 24 GB RTX 3090.

The trick centers on the -ub flag, which controls how many tokens llama.cpp processes in a single GPU kernel call. The default is 512. By raising it to 8192 and simultaneously offloading two additional MoE layers to CPU via --n-cpu-moe 28, the user saw prompt throughput jump from 380 tokens per second to 2091 tokens per second — a 5.5× improvement. Token generation dropped 7 percent, from 32.3 to 30.1 tokens per second.

ubatch	n-cpu-moe	prefill (tok/s)	generation (tok/s)
256	25	240	33.1
512	26	380	32.3
2048	25	1113	33.0
4096	26

More in Community