Qwen 4B agentic fine-tune scores 10% on Terminal Bench 2, runs free on HuggingFace ZeroGPU

A new 4B-parameter Qwen fine-tune trained for agentic tasks runs on HuggingFace ZeroGPU and scores 10 percent on Terminal Bench 2.

ByAlex Sokoloff·May 28, 2026

Qwen 4B agentic fine-tune scores 10% on Terminal Bench 2, runs free on HuggingFace ZeroGPU

A 4-billion-parameter Qwen fine-tune optimized for agentic workflows now runs on HuggingFace's ZeroGPU infrastructure and handles small coding projects. The model, available as a Space on HuggingFace, is tuned for Pi Agent and Hermes Agent task formats and represents a practical middle ground between capability and accessibility for developers working with constrained compute.

The checkpoint is designed to execute terminal commands and write short scripts within an agentic loop. Its compact 4B parameter count lets it run on shared GPU quota without dedicated hardware, making it accessible to practitioners who lack local inference setups or API budgets. ZeroGPU, HuggingFace's shared GPU service for Spaces, allocates compute dynamically, so the model spins up on demand rather than requiring a persistent instance.

On Terminal Bench 2, a benchmark that measures command-line task completion, the fine-tune scores 10 percent—meaning it completes one in ten tasks end-to-end without human intervention. That figure is modest compared to frontier models, but it reflects real progress for a model this size running on free infrastructure. Terminal Bench 2 evaluates whether a model can parse natural-language instructions, generate valid shell commands, and handle multi-step workflows that involve file manipulation, environment setup, and error recovery.

The Pi Agent and Hermes Agent tuning targets are both open frameworks for building tool-using language models. Pi Agent emphasizes structured reasoning and multi-turn task decomposition, while Hermes Agent focuses on function-calling and API integration. By training on both formats, the fine-tune can switch between reasoning styles depending on the prompt structure. Developers can test it now by submitting natural-language task descriptions and watching the model generate and execute commands in a sandboxed terminal environment.

ZenCreator

Qwen 4B agentic fine-tune scores 10% on Terminal Bench 2, runs free on HuggingFace ZeroGPU

More in Releases

Apple accuses OpenAI of soliciting hardware prototypes in job interviews

Lightweight proxy models cut LLM post-training costs while enabling cross-model signal reuse

Colibri runs 744B GLM-5.2 on 25GB RAM by streaming experts from disk

Anthropic extends Fable 5 preview a second week, bumps rate limits 50%

Soofi S 30B activates 3B parameters per token, tops European AI baselines