MAOAM unifies object and material selection via text or click
A new vision-language model framework enables precise image segmentation by object or material using either text descriptions or click interactions, trained on synthetic datasets with VLM-generated annotations.

MAOAM (Mask Any Object And Material) is a unified selection framework that handles both object-level and material-level segmentation through text or click-based prompts. The system pairs a vision-language model with a segmentation head to interpret user intent—whether selecting "the wooden chair" or "all metal surfaces"—and outputs pixel-accurate masks. Material-based selection is valuable for re-texturing surfaces or editing all instances of a specific material, but existing VLM selection tools are object-centric and typically support only one interaction mode.
The authors tackled the material-selection training data gap with a scalable pipeline: collecting real and synthetic images with material masks, then using VLMs to generate rich material descriptions. MAOAM trains on a multi-task objective covering click and text-based selection, plus an auxiliary visual question answering task to deepen material understanding. A key finding: despite training on uni-modal prompts, the model exhibits emergent improvement when combining text and clicks at inference—users can click a surface and add "only the rough metal parts" to refine the selection in a single workflow.
Authored by Jaden Park, Valentin Deschaintre, Jason Kuen, Kangning Liu, Iliyan Georgiev, and Krishna Kumar Singh, the preprint demonstrates accurate selections across diverse objects, materials, and interaction scenarios, with experiments showing robust performance in practical editing tasks.



