TinyHarness cuts memory overhead for local model orchestration
A new open-source AI harness written in a compiled language targets minimal memory overhead while supporting multiple local inference backends and web search.

TinyHarness is a local-first AI orchestration tool that prioritizes memory efficiency by avoiding JavaScript, TypeScript, and Python runtimes. The project supports Ollama, llama.cpp, and vllm backends and can route web queries through Ollama's web search API. By compiling to native code instead of running an interpreter, the harness leaves more VRAM available for the model itself — a practical constraint for users running inference on consumer hardware where every gigabyte counts.
The memory argument sharpens as models grow. A 70B parameter model quantized to 4-bit precision consumes roughly 40GB of VRAM; adding a Node.js or Python interpreter on top can claim another 500MB to 2GB depending on dependencies. The multi-backend design reflects the fragmented state of local inference tooling: Ollama dominates ease-of-use, llama.cpp remains the go-to for performance tuning and exotic quantization, and vllm serves high-throughput batched requests. A harness that speaks to all three without forcing a rewrite has practical appeal for developers prototyping agent systems or RAG pipelines that need to swap backends mid-project. The project is available on GitHub and remains an early-stage prototype.