Qwen models degrade after two weeks on llama.cpp—cause unclear
A user reports Qwen models degrading noticeably after running on llama.cpp for two weeks, raising questions about inference server stability during extended uptime.
A user running Qwen 3.6 models on llama.cpp for roughly two weeks reports the models have become "considerably dumber" than at launch, even after starting fresh sessions. The observation raises questions about whether prolonged server uptime degrades inference quality or if misconfiguration is to blame.
Llama.cpp, the popular C++ inference engine for running open-weight models locally, is designed to be stateless between requests. Model weights loaded into memory shouldn't degrade over time under normal operation—the architecture loads quantized weights once at startup, then serves inference requests without modifying those weights. Each request should be independent, with sampling state cleared between completions. Possible explanations include context window overflow if the server is maintaining long-running conversations without proper truncation, temperature drift from uncleared sampling state in certain API modes, or external factors like system memory pressure causing silent corruption or swap thrashing.
The specific model names are themselves puzzling. Qwen is a series of open-weight models from Alibaba Cloud, with official releases including Qwen 2.5 in late 2024 and Qwen 3 in early 2025. The official Qwen 3 lineup includes 7B, 14B, 32B, and 72B parameter checkpoints, but no 27B or 35B variants. The user may be referring to community fine-tunes, intermediate checkpoints, or mislabeled quantized variants circulating on model-sharing sites. Without a link to the specific model cards, it's unclear what weights are actually running. No reproducible bug report has been filed against the llama.cpp project, and the maintainers have not commented on the claim.
