Qwen 4B agent fine-tune scores 10% on Terminal Bench 2, runs free on HuggingFace Spaces
A new agent-tuned Qwen 4B checkpoint runs on HuggingFace Spaces via ZeroGPU, writing small projects and scoring 10 percent on Terminal Bench 2.

A fine-tuned Qwen 4B checkpoint optimized for agent tasks is now live on HuggingFace Spaces, running on ZeroGPU infrastructure and capable of writing functional small projects. The model scores 10 percent on Terminal Bench 2, a benchmark that measures a language model's ability to generate correct terminal commands and code snippets. It's tuned on Pi Agent and Hermes Agent datasets, giving it enough capability to handle iterative coding workflows inside the HuggingFace environment.
The checkpoint uses Pi Agent as its scaffolding framework and deploys via ZeroGPU, HuggingFace's shared GPU service that spins up compute on demand. That setup makes the model accessible without local hardware—users can test it directly in the browser through the HuggingFace Space. One demo shows the model generating a working Tetris game as a web app, illustrating the kind of self-contained project it can produce when prompted with a task description.
Terminal Bench 2 scores typically range from single digits to low teens for models in the 3B-7B range, so the 10 percent mark places this fine-tune in the viable-for-simple-tasks tier. The benchmark tests both command accuracy and the model's ability to chain multiple steps in a terminal session, making it a stricter eval than single-turn code generation. The 4B parameter size keeps inference fast enough for real-time interaction on shared infrastructure.

