GridProbe cuts long-video VLM compute by 3.36× with adaptive frame selection
A new posterior-probing method selects question-relevant frames on the fly, matching monolithic VLM accuracy on Video-MME-v2 while slashing TFLOPs by more than three-fold.

GridProbe is a training-free inference technique that tackles the quadratic attention cost of processing thousands of frames in long-video vision-language models. Instead of running a single forward pass over every frame, the method arranges frames in a K×K grid and runs lightweight row and column probes that score each frame's relevance to the query in answer space. The outer product of those probe scores yields an interpretable importance map, which drives a closed-form selection rule that adapts the frame budget per question—no training, no auxiliary encoders.
On Video-MME-v2, GridProbe matches the monolithic baseline within 1.6 percentage points of average accuracy while cutting TFLOPs by 3.36×. On LongVideoBench it Pareto-dominates the baseline, delivering +0.9 pp at 0.35× the compute. Because the selector and QA models are decoupled, pairing a small 2B selector with a stronger 4B or 8B QA model beats the 2B monolithic baseline by up to 4.0 pp at 0.52× compute—no retraining required.
What stands out
- 01Adaptive frame budgets track question difficulty. GridProbe's Shape-Adaptive Selection rule uses the skewness and kurtosis of the importance map to set a per-question effective frame count, which correlates with intrinsic query complexity without seeing the answer—a sign of test-time adaptive compute.
- 02Posterior probing beats contrastive pre-training signals. Training-free selectors that rely on auxiliary encoder-space similarities fail on reasoning-heavy queries (negation, cross-frame counting, holistic summarization). GridProbe scores frames in answer space using the VLM's own reasoning, sidestepping those failure modes.
- 03Decoupled selector + QA is Pareto-dominant. A 2B selector paired with a 4B or 8B QA model strictly outperforms the 2B monolithic baseline—up to +4.0 pp on average at roughly half the compute—because the selector and QA forward passes are independent.
- 04