Ideogram 4 open-weights: 9.3B DiT with Qwen vision encoder, native 2K
Ideogram released its first open-weight text-to-image model, a 9.3B-parameter single-stream Diffusion Transformer trained from scratch with multilingual text rendering, JSON-structured prompts, and explicit color control.
Ideogram 4, a 9.3-billion-parameter text-to-image model, is now open-weight. Released this week, it marks Ideogram's first open-source release and the first model the company trained from scratch rather than fine-tuned from an existing base. The weights ship in nf4 (CUDA) and fp8 formats, with additional quantizations promised. At 9.3B parameters, Ideogram 4 runs on consumer GPUs—substantially smaller than Qwen-Image (20B) or FLUX.2 dev (32B).
The architecture is a fully single-stream Diffusion Transformer with 34 layers. Text and image tokens are concatenated into a unified sequence and processed through the same transformer, with no separate branches. Instead of a text-only encoder like CLIP or T5, Ideogram 4 uses Qwen3-VL-8B-Instruct, a full vision-language model that provides richer understanding of visual concepts. The model was trained on JSON-structured prompt annotations and includes a built-in prompt enhancer and prompt guide.
What stands out
- 01Native 2K resolution and extreme aspect ratios. The model generates images at 2048px natively and supports aspect ratios up to 6:1, wider than most open-weight competitors.
- 02Multilingual text rendering. Ideogram 4 ships with what the team calls "best-in-class" multilingual text rendering, a capability that has historically been weak in open models.
- 03Explicit color palette control. The JSON prompt interface allows direct specification of color palettes, giving users fine-grained control over output aesthetics.
- 04Vision-language encoder. Using Qwen3-VL-8B-Instruct as the text encoder instead of a text-only model is a structural departure from FLUX, SDXL, and most other open DiTs.
- 05




