We study post-training quantization (PTQ) of Ideogram 4.0, a 9.3B flow-matching diffusion transformer (DiT) that realizes classifier-free guidance with two separate-weight copies of a single-stream backbone and is conditioned by a Qwen3-VL text encoder, targeting Ampere RTX~3090 GPUs, which lack FP8 tensor cores. Because Ideogram~4.0 is trained on structured JSON captions, we evaluate every variant under schema-valid JSON prompts produced by an LLM expander built to Ideogram's published caption specification, and score them with a battery spanning human-preference (HPSv2), CLIP, and PickScore for standalone quality; PP-OCR exact-match and edit distance for text; and PSNR/SSIM/LPIPS for fidelity to the FP8 reference (the highest-precision public checkpoint) output. On a 300-prompt benchmark with paired bootstrap confidence intervals, an INT8 W8A8 recipe (per-channel weights, per-token dynamic activations, SmoothQuant, and bf16 protection of a small high-fragility layer set) is statistically indistinguishable from FP8 on CLIP and PickScore (paired CIs include zero) and within ~0.004 HPSv2, and, at its 8-bit size, is the most faithful reproduction of the FP8 output (LPIPS 0.243 vs 0.277/0.306 for the half-size 4-bit baselines; the INT8-Q4_K gap excludes zero). A GGUF Q4_K quantization reaches the same standalone quality as the published NF4 baseline at the same on-disk size, making it the Pareto choice on the quality-memory frontier. We further show that under JSON prompts all four variants reach parity on standalone quality, the variants separate on fidelity and text rendering, not on aggregate image-quality scores, and that text legibility, near-zero when the model is prompted with raw strings, reaches 55% OCR exact-match under the JSON captions it expects. We release the INT8 W8A8 and GGUF Q4_K quantized weights on Hugging Face under a gated, non-commercial license.
翻译:我们研究了Ideogram 4.0的后训练量化(PTQ),该模型是一个93亿参数的流匹配扩散变压器(DiT),通过单流骨干的两个独立权重副本来实现无分类器指导,并由Qwen3-VL文本编码器提供条件,目标平台为缺乏FP8张量核心的Ampere RTX 3090 GPU。由于Ideogram 4.0在结构化JSON标题上训练,我们使用符合Ideogram发布标题规范的LLM扩展器生成的模式有效JSON提示评估每个变体,并通过涵盖人类偏好(HPSv2)、CLIP和PickScore的指标集对独立质量进行评分;通过PP-OCR精确匹配和编辑距离评估文本;通过PSNR/SSIM/LPIPS评估相对于FP8参考(最高精度公开检查点)输出的保真度。在包含配对自助置信区间的300提示基准测试中,INT8 W8A8方案(逐通道权重、逐令牌动态激活、SmoothQuant以及对少量高脆弱性层进行bf16保护)在CLIP和PickScore上与FP8统计无显著差异(配对置信区间包含零),在HPSv2上差距约0.004,且在其8比特尺寸下是FP8输出最忠实的复现(LPIPS为0.243,而半大小4比特基线为0.277/0.306;INT8-Q4_K差距排除零)。GGUF Q4_K量化在相同磁盘尺寸下达到与已发布NF4基线相同的独立质量,成为质量-内存前沿上的帕累托最优选择。我们进一步表明,在JSON提示下所有四个变体在独立质量上达到同等水平,变体在保真度和文本渲染上有所区分,而非聚合图像质量分数,且文本可读性(当模型以原始字符串提示时接近零)在其期望的JSON标题下达到55%的OCR精确匹配。我们在Hugging Face上以门控非商业许可发布INT8 W8A8和GGUF Q4_K量化权重。