SmallCode hits 87% on coding benchmarks using Gemma 4B with compound-tool design

A developer released SmallCode, a terminal-based coding agent that scores 87% on benchmark tasks using Gemma 4B by bundling tool calls, auto-fixing errors, and managing token budgets—outperforming 14B-model agents without scaling.

May 16, 2026

SmallCode hits 87% on coding benchmarks using Gemma 4B with compound-tool design

A developer frustrated with coding agents built for frontier models has released SmallCode, a terminal-based coding agent that reaches 87 out of 100 benchmark tasks using Gemma 4B—a 4-billion-parameter model—and outscores OpenCode running 14B models by roughly 12 percentage points. The difference isn't model size; it's harness design.

SmallCode solves the core failure mode of small-model agents: sequential tool calls. Instead of asking a 4B model to chain four separate operations (find file, read file, edit file, verify), SmallCode bundles all four into a single compound tool. This alone cuts failure rates in half. When code generation fails, an improvement loop immediately compiles and lints the output, feeding errors back to the model for a second pass. Small models don't need to get it right the first time if they can fix mistakes when shown them.

If a task fails twice, SmallCode decomposes the problem into smaller pieces—a 200-line file edit becomes a single-line fix. If decomposition still fails and the user has configured a Claude or OpenAI API key, the agent escalates that one task to the larger model, keeping 95% of work local. Token budgeting ensures the model never sees mid-context truncation: SmallCode summarizes and truncates aggressively, staying inside the 32k–256k context windows typical of local models. A code-graph index replaces grep, walking function and class relationships to return only the relevant connected code when a user asks how a subsystem works.

The agent ships as a full-screen terminal UI with scrollable chat, a command palette, and a plugin system. It does not yet integrate language servers, support multi-session workflows, or ship a desktop app. The developer positions it as a tool for practitioners running local models, not as a competitor to Claude Code or Cursor for users with frontier API access.

More in Community