Loading…

MacOS practitioners seek faster llama.cpp tuning for 100k-context inference | UncensoredHub

CommunityNSFW

MacOS practitioners seek faster llama.cpp tuning for 100k-context inference

A practitioner running Qwen3.5-35B at 100k context on MacOS asks for optimization shortcuts, highlighting a gap between llama-bench's comprehensiveness and the time cost of flag tuning.

May 18, 2026

MacOS practitioners seek faster llama.cpp tuning for 100k-context inference

A practitioner running Qwen3.5-35B-A3B in GGUF format on MacOS is hitting 1500 tokens/sec prompt processing and 35–50 tokens/sec generation at 100k context with llama.cpp. The bottleneck isn't the model—it's the time spent tuning flags. This week, they asked for a faster way to find optimal llama.cpp settings without running the full llama-bench suite across every flag combination.

The core problem is familiar to anyone scaling context windows: llama-bench can theoretically surface the best configuration, but testing every permutation across multiple models and long context windows takes hours. The user discovered llama-optimus, a third-party optimization tool, but couldn't figure out how to configure it for 100k-context benchmarks—it appears designed for shorter ranges. The question: is there a way to use llama-bench more selectively, or does llama-optimus support long-context tuning that isn't documented?

What stands out

01Performance baseline is solid. 1500 tokens/sec prompt processing and 35–50 tokens/sec generation on MacOS with a 35B model at 100k context is respectable for consumer hardware. The user isn't chasing a fix—they're chasing efficiency in the tuning process itself.
02llama-optimus exists but isn't well-documented for long context. The tool doesn't clearly explain how to test at 100k tokens, and the user couldn't map its config flags to llama-bench's parameter space. That's a documentation gap, not a capability gap—someone who's used it at scale could likely answer in minutes.
03No consensus workflow for multi-model tuning. The post asks what other practitioners do when they onboard a new model or want to squeeze out the last 10 percent of performance. Whether the community has converged on a standard approach or is still tuning case-by-case remains an open question.
04 Spending more hours optimizing than running inference is the complaint. That's a tooling problem—llama.cpp's flag space is large, the interactions between flags aren't always intuitive, and there's no "auto-tune for my hardware" button that works reliably at long context.

What stands out

More in Community