llama.cpp ubatch tuning delivers 5.5× prompt speedup on RTX 3090
A LocalLLaMA user reports that raising llama.cpp's ubatch parameter from 512 to 8192 while offloading more MoE layers to CPU boosted prompt processing from 380 to 2091 tokens per second on a 24 GB card.
Tuning llama.cpp's micro-batch size can deliver dramatic prompt-processing gains at the cost of a small generation slowdown, according to a user running the 120-billion-parameter gpt-oss-120b model on a 24 GB RTX 3090.
The trick centers on the -ub flag, which controls how many tokens llama.cpp processes in a single GPU kernel call. The default is 512. By raising it to 8192 and simultaneously offloading two additional MoE layers to CPU via --n-cpu-moe 28, the user saw prompt throughput jump from 380 tokens per second to 2091 tokens per second — a 5.5× improvement. Token generation dropped 7 percent, from 32.3 to 30.1 tokens per second.
| ubatch | n-cpu-moe | prefill (tok/s) | generation (tok/s) |
|---|---|---|---|
| 256 | 25 | 240 | 33.1 |
| 512 | 26 | 380 | 32.3 |
| 2048 | 25 | 1113 | 33.0 |
| 4096 | 26 |
