HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose $\textbf{HiLo-Token}$, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.

翻译：创意图像编辑工具（如Photoshop中的移除或生成式填充按钮）是日常用户使用的核心功能，并占用了Photoshop和Lightroom中的大部分流量。然而，当前生成式AI模型面临显著的延迟挑战，当从基于卷积的U-Net过渡到扩散Transformer（DiTs）时，这一问题更为突出。在我们对涵盖广泛遮罩比例的数百个代表性图像编辑样本的评估中，即使将DiT模块从50个时间步蒸馏至8个时间步后，该模块仍平均占模型总延迟的73%。为解决这一挑战，我们提出$\textbf{HiLo-Token}$——一种输入自适应令牌压缩框架，该框架为高频、丰富上下文区域分配更多令牌预算，同时为低频区域分配较少令牌。具体而言，对于用户遮罩指定的编辑区域，我们保留经膨胀遮罩内的所有令牌，以保持强局部性和上下文相关性。在编辑区域外，我们引入一种简单而有效的基于空间频率的高频令牌选择策略，以捕获重要局部细节，同时使用16倍下采样图像的令牌表示低频成分，保留模糊但全局的结构。在生产级评估数据上的大量实验验证了所提方法的有效性，在A100-80GB上针对图像编辑任务中平均比例为6.38%、15.92%和35.36%的小、中、大遮罩比例类别，分别实现了3.13倍、2.59倍和1.67倍的DiT加速，且生成质量未出现任何退化。