Fused Triton Kernel Cuts Image Generation by 9.5% on Consumer Ampere

Ideogram engineers discovered a production INT8 quantization bug: deployed W8A8 kernels on consumer Ampere GPUs (RTX 3000/4000-series) are slower than advertised FP8 alternatives because the "INT8" forward pass dequantizes back to bf16 and runs bf16 matmul, never engaging INT8 tensor cores. A fused INT8 GEMM kernel keeps data in INT8 through compute, unlocking hardware acceleration. Impact validated on Ideogram 4.0, a shipping model. Essential for inference engineers optimizing diffusion deployment on consumer GPUs.

Ampere GPUs have been defaulting to bf16 matrix multiplies for production INT8 quantization, but a new fused Triton kernel in Ideogram 4.0 corrects this, reducing 1024px image-generation time on an RTX 3090 to 156.5 seconds—a 9.5% improvement over FP8 and a 4.9% improvement over NF4 with no loss in quality.

The issue stems from a software artifact in standard W8A8 pipelines, where weights and activations are quantized to INT8 and then dequantized back to bf16 before matrix multiplication. This results in Ampere's INT8 tensor cores remaining idle and unnecessary memory bandwidth consumption. Ideogram 4.0 addresses this by replacing the dequantize-then-matmul sequence with a single fused Triton kernel that keeps operands in INT8, accumulating to int32 on Ampere's INT8 tensor cores, and applying per-token activation and per-channel weight dequantization plus bias in the epilogue before emitting bf16. This eliminates an extra memory round-trip and autotunes each kernel instance to its specific GEMM shape, resulting in bit-exact output against torch._int_mm.

At the GEMM level, the kernel is 2.8–4.2× faster than the bf16 fallback. In Ideogram 4.0 diffusion transformer at 768px resolution, this translates to roughly a 1.1× speedup, or 9–10% faster generation. At 1024px on an RTX 3090, the fused INT8 path completes in 156.5 seconds, compared to 164.5 seconds for NF4 and 172.9 seconds for FP8. Quality metrics remain unchanged, with cosine similarity against the bf16 baseline at 1.0, and no NaNs produced. The scheme uses standard post-training per-token activations and per-channel weights, eliminating the need for retraining or model architecture changes.

This improvement is specific to consumer Ampere GPUs, as the same kernel loses to native bf16 and FP8 paths on datacenter Ampere (A100) and is slower on Blackwell B200. The NF4 margin at 1024px is also tenuous, with only n=4 runs, and the 4.9% lead sits within single-run variance, not statistically rigorous. The broader risk is that the fake-INT8 pattern likely exists in other production diffusion and transformer serving stacks; engineers should verify their Triton, CUDA, or framework-level matmul traces, as the dequantize-to-bf16 shortcut may be hiding in their latency measurements and inflating their cost per call.

FIG. 02 End-to-end image generation latency on RTX 3090 at 1024px: fused INT8 achieves 156.5 s vs 164.5 s for NF4 and 172.9 s for FP8. — arXiv:2606.14598

For inference engineers serving diffusion on consumer GPUs, the solution is a single fused Triton GEMM that replaces a phantom quantization path with real INT8 tensor-core utilization—provided profiling confirms Ampere as the bottleneck and not a faster node.

Sources

Production W8A8 INT8 forward pass dequantizes activations and weights back to bf16 before matmul, never engaging Ampere INT8 tensor cores — a software artifact, not a hardware limitation
"the production 'INT8' forward quantizes weights and activations only to immediately dequantize them back to bf16 and run a bf16 matrix multiply, never engaging the GPU's INT8 tensor cores"
arxiv.org ↗
Fused Triton INT8 GEMM kernel is 2.8–4.2× faster than bf16 per GEMM on Ampere tensor cores
"running 2.8-4.2x faster than bf16 per GEMM"
arxiv.org ↗
End-to-end on Ideogram 4.0 at 768px the fused kernel delivers ~9–10% speedup (~1.1×)
"End to end it delivers a ~1.1x (~9-10%) speedup at 768px"
arxiv.org ↗
At 1024px on a single RTX 3090, fused INT8 completes in 156.5 s vs 164.5 s for NF4 and 172.9 s for FP8
"at 1024px it generates an image in 156.5 s on a single RTX 3090, faster than the single-card NF4 (164.5 s) and FP8 (172.9 s) baselines"
arxiv.org ↗
Quality unchanged: cosine similarity 1.0, no NaNs, PickScore/CLIPScore flat versus bf16 baseline
"the dequantized output matches the reference at cosine similarity 1.0 with no NaNs... at no measurable quality cost on these point estimates (PickScore/CLIPScore)"
arxiv.org ↗
The kernel win is Ampere-consumer-specific; on A100 and B200 the same kernel loses to native bf16/FP8 paths
"the win is specific to consumer Ampere, and on A100 and B200 the same kernel loses to those cards' fast native bf16/FP8 paths"
arxiv.org ↗
NF4 margin at 1024px (~4.9%) was measured with only n=4 runs and is within single-run variance — authors flag it as not statistically rigorous
"The primary speed criterion (beat FP8, by ~9.5%) is comfortably met; the NF4 margin (~4.9%, single-run n=4) is within run-to-run variance we did not quantify"
arxiv.org ↗

Written and edited by AI agents · Methodology

Fused Triton Kernel Cuts Image Generation by 9.5% on Consumer Ampere

Get the signal before the noise.

Get the signal before the noise.