Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0

Post-training INT8 (W8A8) quantization of diffusion transformers is widely deployed as a speed optimization, yet on consumer Ampere GPUs it is frequently slower than the FP8 and NF4 alternatives it is meant to beat. We trace this to a software artifact: the production "INT8" forward quantizes weights and activations only to immediately dequantize them back to bf16 and run a bf16 matrix multiply, never engaging the GPU's INT8 tensor cores, so the hardware's compute advantage is left entirely unrealized. We close this gap with a single fused Triton INT8 GEMM (int8xint8->int32 on Ampere tensor cores, with per-token x per-channel dequantization and bias folded into the epilogue, autotuned per GEMM shape) dropped into the Ideogram 4.0 diffusion transformer's linear layers in place of the dequantize-to-bf16 path. In the kernel, the int8xint8->int32 accumulation is bit-exact against torch._int_mm and the dequantized output matches the reference at cosine similarity 1.0 with no NaNs, running 2.8-4.2x faster than bf16 per GEMM. End to end it delivers a ~1.1x (~9-10%) speedup at 768px, and at 1024px it generates an image in 156.5 s on a single RTX 3090, faster than the single-card NF4 (164.5 s) and FP8 (172.9 s) baselines, at no measurable quality cost on these point estimates (PickScore/CLIPScore). INT8 thus goes from the slowest variant to the fastest, and 1024px becomes single-GPU feasible. The primary speed criterion (beat FP8, by ~9.5%) is comfortably met; the NF4 margin (~4.9%, single-run n=4) is within run-to-run variance we did not quantify and is best read as consistent with meeting the stretch target. We close with an honest deployment map: the win is specific to consumer Ampere, and on A100 and B200 the same kernel loses to those cards' fast native bf16/FP8 paths.

翻译：扩散变换器的训练后INT8（W8A8）量化作为速度优化手段被广泛部署，然而在消费级安培GPU上，其运行速度往往慢于本应超越的FP8和NF4方案。我们将其归因于软件架构缺陷：实际部署的"INT8"前向过程先将权重和激活值量化，随后立即反量化回bf16并执行bf16矩阵乘法，从未调用GPU的INT8张量核心，因此硬件计算优势完全未得到利用。我们通过将单一融合Triton INT8 GEMM核函数（在安培张量核心上执行int8×int8→int32运算，在每个token间/每个通道间反量化与偏置合并至寄存器末尾，按GEMM形状自动调优）嵌入Ideogram 4.0扩散变换器的线性层，替代原有的反量化至bf16路径，从而弥合了这一性能差距。在该核函数中，int8×int8→int32累加结果与torch._int_mm实现完全逐位一致，反量化输出与参考结果在余弦相似度为1.0且无NaN的条件下保持一致，每个GEMM运算速度较bf16提升2.8-4.2倍。端到端测试中，在768px分辨率下实现了约1.1倍（~9-10%）加速，在1024px分辨率下使用单卡RTX 3090生成图像仅需156.5秒，优于单卡NF4（164.5秒）和FP8（172.9秒）基准方案，且在这些点估计指标（PickScore/CLIPScore）上未检测到质量损失。因此，INT8从最慢方案跃升至最快方案，1024px分辨率成为单GPU可行方案。主要速度指标（超越FP8约9.5%）轻松达成；NF4的边际优势（约4.9%，单次运行n=4）处于我们未量化的运行间方差范围内，最合理的解读是与达标预期相符。我们最后给出诚实的部署路线图：该优势仅适用于消费级安培架构，在A100和B200上相同核函数的性能反而低于这些显卡的快速原生bf16/FP8路径。