BiGain: Unified Token Compression for Joint Generation and Classification

Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.

翻译：扩散模型的加速方法（如令牌合并或下采样）通常以降低计算量为前提优化合成质量，却往往忽视判别能力。本文以联合目标重新审视令牌压缩，提出BiGain——一种无需训练、即插即用的框架，在保持生成质量的同时提升加速扩散模型的分类性能。我们的核心见解是频率分离：将特征空间信号映射至频率感知表示，能够解耦细节特征与全局语义，从而实现兼顾生成保真度与判别效用的压缩。BiGain通过两个频率感知算子体现这一原则：（1）拉普拉斯门控令牌合并，鼓励频谱平滑的令牌间合并而抑制高对比度令牌的合并，从而保留边缘与纹理；（2）插值-外推KV下采样，通过在最近邻池化与平均池化间进行可控的内插-外推对键/值进行下采样，同时保持查询不变，从而维持注意力精度。在基于DiT与U-Net的骨干网络及ImageNet-1K、ImageNet-100、Oxford-IIIT Pets和COCO-2017数据集上的实验表明，我们的算子能持续改善基于扩散的分类任务的速度-精度权衡，并在可比加速条件下保持或提升生成质量。例如在ImageNet-1K上，对Stable Diffusion 2.0进行70%令牌合并时，BiGain将分类准确率提升7.15%，同时使FID改善0.34（相对提升1.85%）。分析表明，平衡的频谱保留——即同时保持高频细节与中低频语义——是扩散模型中令牌压缩的可靠设计准则。据我们所知，BiGain是首个在加速扩散场景下联合研究与推进生成与分类任务的框架，为低成本部署提供了支持。