InstructSAM segments multiple objects from text in one pass
A 2B-parameter framework injects learnable instance queries into a vision-language model, enabling SAM3 to segment multiple objects from free-form text instructions in a single forward pass.
InstructSAM is a multi-instance segmentation framework that connects a vision-language model to SAM3 through learnable instance queries. Researchers at Zhejiang University and Alibaba treat instruction-driven segmentation as a set-structured query prediction problem: a bank of instance queries is injected into the VLM, contextualized with both the input instruction and visual features, then projected into SAM3's detector query space. A hybrid-attention mechanism lets queries interact with visual tokens and instruction tokens, reducing duplicate predictions and improving instance enumeration. The design leaves SAM3's core architecture unchanged while adding high-level reasoning and compositional understanding.
The architecture addresses a longstanding gap in the Segment Anything ecosystem. SAM and its successors excel at mask prediction when given explicit spatial prompts—clicks, bounding boxes, or reference points—but struggle with natural-language instructions that require compositional reasoning. Prior approaches either fine-tuned SAM's encoder end-to-end, losing its generalization, or chained a separate language model with iterative prompting, adding latency and error accumulation. InstructSAM's explicit reasoning-to-instance query interface sidesteps both problems. The VLM processes the instruction and image jointly, producing a set of instance-aware queries that SAM3's decoder can consume directly. The result is single-pass multi-instance prediction with no prompt iteration.
The team released Inst2Seg alongside the model—a large-scale dataset pairing free-form instructions with instance-level masks. The benchmark covers complex instruction-driven and phrase-level referring segmentation tasks, scenarios where earlier methods often returned incomplete or duplicate masks. Experiments show the 2B-scale InstructSAM outperforms prior end-to-end methods and SAM3's agentic pipeline on these benchmarks. The framework's efficiency advantage is clearest in multi-instance queries: asking for "all red cars and pedestrians near the crosswalk" returns a complete mask set in one forward pass, where an agentic pipeline would require multiple LLM calls and SAM invocations.


