InstructSAM segments multiple objects from text in one pass

A 2B-parameter framework injects learnable instance queries into a vision-language model, enabling SAM3 to segment multiple objects from free-form text instructions in a single forward pass.

ByAlex Sokoloff·May 26, 2026

InstructSAM segments multiple objects from text in one pass

InstructSAM is a multi-instance segmentation framework that connects a vision-language model to SAM3 through learnable instance queries. Researchers at Zhejiang University and Alibaba treat instruction-driven segmentation as a set-structured query prediction problem: a bank of instance queries is injected into the VLM, contextualized with both the input instruction and visual features, then projected into SAM3's detector query space. A hybrid-attention mechanism lets queries interact with visual tokens and instruction tokens, reducing duplicate predictions and improving instance enumeration. The design leaves SAM3's core architecture unchanged while adding high-level reasoning and compositional understanding.

The architecture addresses a longstanding gap in the Segment Anything ecosystem. SAM and its successors excel at mask prediction when given explicit spatial prompts—clicks, bounding boxes, or reference points—but struggle with natural-language instructions that require compositional reasoning. Prior approaches either fine-tuned SAM's encoder end-to-end, losing its generalization, or chained a separate language model with iterative prompting, adding latency and error accumulation. InstructSAM's explicit reasoning-to-instance query interface sidesteps both problems. The VLM processes the instruction and image jointly, producing a set of instance-aware queries that SAM3's decoder can consume directly. The result is single-pass multi-instance prediction with no prompt iteration.

The team released Inst2Seg alongside the model—a large-scale dataset pairing free-form instructions with instance-level masks. The benchmark covers complex instruction-driven and phrase-level referring segmentation tasks, scenarios where earlier methods often returned incomplete or duplicate masks. Experiments show the 2B-scale InstructSAM outperforms prior end-to-end methods and SAM3's agentic pipeline on these benchmarks. The framework's efficiency advantage is clearest in multi-instance queries: asking for "all red cars and pedestrians near the crosswalk" returns a complete mask set in one forward pass, where an agentic pipeline would require multiple LLM calls and SAM invocations.

ZenCreator

InstructSAM segments multiple objects from text in one pass

More in Releases

Apple accuses OpenAI of soliciting hardware prototypes in job interviews

Lightweight proxy models cut LLM post-training costs while enabling cross-model signal reuse

Colibri runs 744B GLM-5.2 on 25GB RAM by streaming experts from disk

Anthropic extends Fable 5 preview a second week, bumps rate limits 50%

Soofi S 30B activates 3B parameters per token, tops European AI baselines