Gemini 3.1 Pro hits 71% on nested curve reasoning; open Qwen3-VL-8B lags at 33%
New benchmark shows frontier models struggle to parse containment hierarchies from simple nested shapes — Gemini 3.1 Pro hits 71% on easy cases, 19% on hard.
CurveBench, a new benchmark from researchers Amirreza Mohseni, Mona Mohammadi, Morteza Saghafian, and Naser Talebizadeh Saradari, tests whether vision models can read containment hierarchies from nested Jordan curves. The benchmark consists of 756 images showing non-intersecting loops — polygons, topographic contours, maze-like tangles, or dense overlapping shapes — and requires models to output a rooted tree encoding which regions sit inside which others. The task looks trivial to a human eye but trips up current multimodal systems.
Gemini 3.1 Pro, the strongest closed model tested, achieved 71.1% tree-generation accuracy on CurveBench-Easy and 19.1% on CurveBench-Hard. That 52-point drop suggests the models are pattern-matching surface features rather than reasoning about spatial relationships. GPT-5.4 and Claude Opus 4.5 performed worse under the same evaluation protocol.
When the authors applied RLVR-style fine-tuning to Qwen3-VL-8B, an open-weight vision-language model, the trained checkpoint improved from 2.8% to 33.3% tree-generation accuracy on CurveBench-Easy, outperforming both GPT-5.4 and Claude Opus 4.5 on that split. The 30-point gain shows the benchmark is learnable with the right training signal, but one-third accuracy on the easiest split underscores how far open models lag on exact spatial reasoning. The fine-tuned Qwen3-VL-8B still trails Gemini 3.1 Pro by 38 points on the easy set, and the preprint (arXiv:2605.14068) was posted May 15, 2026.
