Platform

mm-ctx indexes images, video, audio, and PDFs for LLM agents in 60ms

A new command-line tool converts images, video, audio, and PDFs into LLM-readable context, with sub-100ms metadata queries and automatic file indexing.

May 17, 2026

mm-ctx indexes images, video, audio, and PDFs for LLM agents in 60ms

mm-ctx is a multimodal file indexer that turns images, video, audio, and PDFs into text context for language model agents. The tool ships with a Rust core and Python wrapper, indexing files automatically on first use and exposing a Unix-style command interface for agents to query and describe media.

The tool handles image captioning, object detection, video keyframe summarization, PDF text extraction, and audio transcription through a single mm cat command. Agents can search files with mm find, count tokens for context budgets with mm wc, and run SQL queries against file metadata with mm sql. Metadata operations run in roughly 60 milliseconds across 700-file collections.

Installation and integration

More in Platform

mm-ctx indexes media on disk into a queryable database at first launch. The Rust core handles file I/O and metadata extraction; the Python layer wires in vision and audio models. Install via pip install mm-ctx or uv pip install mm-ctx. The tool also ships as a skill module for other agent frameworks through npx skills add vlm-run/skills@mm-cli-skill.

The project targets agentic workflows where a language model needs to reason over local media collections—photo libraries, video archives, document folders—without manually preprocessing every file into text. Use cases range from autonomous document review to multi-modal retrieval-augmented generation pipelines.