PixVerve-95K: 95,000 images enable native 100-megapixel text-to-image generation
Researchers released PixVerve-95K, a 95,000-image ultra-high-resolution dataset and training framework that extends foundation models to native 100-megapixel generation.
PixVerve-95K is an open-source ultra-high-resolution text-to-image dataset containing 95,000 images, each with a minimum pixel count of 100 megapixels. The dataset includes seven-dimensional annotations and covers diverse visual scenarios, addressing the scarcity of training material for image generation beyond standard 1K and 2K resolutions. Researchers led by Haojun Chen published the work as a preprint on May 20, 2026.
Most text-to-image models in wide use today — Stable Diffusion XL, FLUX, Midjourney — train and generate natively at resolutions between 1024×1024 and 2048×2048 pixels. Users who want larger output typically rely on upscaling passes or tiled generation, which can introduce artifacts or lose global coherence. Native ultra-high-resolution generation has remained out of reach because high-quality training data at that scale is rare and expensive to curate. PixVerve-95K tackles that bottleneck with a custom data pipeline built to handle the complexity of 100-megapixel content.
The researchers used the dataset to train three schemes that extend existing text-to-image foundation models to native 100MP generation. They also introduced PixVerve-Bench, an evaluation protocol that combines conventional image-quality metrics with multimodal large-language-model assessments to measure both visual fidelity and semantic alignment at ultra-high resolution. The benchmark captures whether generated images maintain detail and compositional integrity at pixel counts an order of magnitude higher than current norms. Each of the 95,000 images meets the 100-megapixel threshold, making PixVerve-95K the largest publicly available collection purpose-built for native UHR generation training.
