Open-source developer seeks local vision models for sub-2-second cursor overlay

A developer building an open-source cursor-aware AI overlay is crowdsourcing recommendations for local vision models fast enough to match Gemini 3 Flash while reliably handling function calls.

May 15, 2026

Open-source developer seeks local vision models for sub-2-second cursor overlay

An open-source developer is asking which vision models can power a cursor-aware AI overlay with sub-2-second time-to-first-token performance. The app, AIPointer, currently routes through cloud providers like OpenRouter and Anthropic, defaulting to Gemini 3 Flash for speed and vision quality. The core interaction is immediate: hold a key, ask a question about the screen region near your cursor, and get an answer before the UX breaks.

The technical bar is tight. The model must accept vision input alongside text, run fast on consumer hardware (M-series Macs, RTX 3090/4090), and reliably handle six function calls—fetch_url, open_url, copy, save, reveal_folder, read_clipboard. The developer has shortlisted Qwen2.5-VL for vision and tool-use balance, MiniCPM-V for reported speed, Llama 3.2 Vision for potentially stronger tool calling despite slower inference, and Pixtral for vision strength with unclear tool support. The inference stack question is open too: llama.cpp, Ollama, LM Studio, vLLM, or MLX for Mac.

Practitioners who've shipped vision models with tool calls in production are being asked to share real time-to-first-token numbers, not benchmark claims. If a solid local combo emerges, it will be added as a built-in provider option in AIPointer alongside the cloud routes. The core uncertainty remains whether any open-weight model can actually hit the sub-2-second bar under multi-tool function calling with vision context on mid-range hardware—or whether the local path will require a different UX trade-off, such as accepting longer latency for specific query types or accepting a smaller model that trades reasoning depth for speed.

More in Community