LTX 2.3 speed test: H100 and RTX 5090 with CPU offload beat A100 on both timing and quality
A practitioner shared timing and quality notes from testing LTX 2.3 video synthesis on five GPU configs, finding H100 and 5090 with CPU offload deliver the best results while A100 underperformed on visual realism.
LTX 2.3, Lightricks' open-weight video diffusion model, runs fastest on H100 and RTX 5090 hardware according to benchmark data shared this week by a user who tested the model across five GPU configurations at 704×1280 resolution. The tests covered five-second and twenty-second clips in FP8 quantization, bfloat16, and CPU-offload modes.
The H100 delivered a five-second distilled clip in 45 seconds at full precision and 48 seconds in FP8, making it the speed champion. A twenty-second clip at 481 frames and 28 steps took just under 390 seconds. The RTX 5090 matched H100 on short clips—43 seconds for a five-second distilled FP8 render—but hit out-of-memory failures on twenty-second 704×1280 runs unless resolution dropped to 576×1024 or CPU offload kicked in. With offload enabled, the 5090 rendered a twenty-second clip in 299 seconds, faster than the A100's 608-second serverless step time.
Performance and quality tradeoffs
The A100 lagged on both speed and output quality. Twenty identical prompts run on A100 and H100 from the same cloud host consistently produced less realistic scenes on the A100, though the cause remains unclear. The tester also flagged the 5090's distilled FP8 output as visually inferior to its CPU-offload mode, suggesting memory pressure degrades sample quality even when renders complete.
The L40 proved the slowest option, taking 197 seconds for a five-second FP8 clip and 365–453 seconds for twenty-second renders in low-memory FP8 mode. The tester suspects the rented L40 instance was in poor condition, as it required batch-size-one settings to avoid crashes. For practitioners chasing spoken dialogue, the data suggests targeting 45–52 words per twenty seconds and avoiding critical words at clip end, where the model occasionally truncates the final syllable.
