Loading…

Snapdragon 8 Elite runs 35B MoE models at 11–25 tokens/sec on Android | UncensoredHub

Community

Snapdragon 8 Elite runs 35B MoE models at 11–25 tokens/sec on Android

A user running Qwen and Gemma MoE models on Honor Magic 7 Pro reports CPU inference still beats NPU and GPU for speed, with IQ4_XS and MXFP4_MOE quants delivering the best quality-to-size ratios.

May 16, 2026

Snapdragon 8 Elite runs 35B MoE models at 11–25 tokens/sec on Android

The Snapdragon 8 Elite chipset can run mixture-of-experts models up to 35 billion parameters on Android phones with 24GB RAM, according to hands-on testing shared this week. A user running Qwen 3.5-35B-A3B, Gemma-4-A4B-26B, and other MoE models on a Honor Magic 7 Pro reports generation speeds between 11 and 25 tokens per second, with CPU inference currently outpacing both the Hexagon NPU and OpenCL GPU despite higher heat output.

The 24GB ceiling matters because no Android phones ship with 32GB physical RAM, and virtual RAM extensions don't work with local LLM inference. That constraint pushes users toward MoE architectures, which pack more intelligence into smaller active parameter counts. The tester settled on two quant formats: MXFP4_MOE for maximum speed at good quality, and IQ4_XS for the best quality-per-gigabyte when running other apps alongside the model. Q4_0 runs faster but delivers noticeably worse output; Q4_K_M splits the difference.

What stands out

01LFM-24B-A2B is the speed leader. The smallest model tested, a 24-billion-parameter MoE with only 2 billion active parameters, hits 25 tokens per second generation and 60 tokens per second prompt processing. The tester calls it "incredibly smart for its size" and wants more A2B MoE models.
02Qwen 3.5-35B-A3B and Gemma-4-A4B-26B trade speed for reasoning. Qwen 3.5 (preferred over 3.6 by the tester) and Gemma sit at the other end of the curve: 11–12 tokens per second generation, 40 tokens per second prompt processing, but noticeably better reasoning.
03Qwen 3-30B-A3B-2507 frees up RAM for other apps. The smaller Qwen variant leaves headroom when the OS and background services eat into the 24GB budget, a real concern for multitasking.
04GPT-OSS-20B is over-censored. The tester explicitly warns against it: refusals trigger too easily, even on borderline queries.

What stands out

More in Community