Show the Signal, Hide the Noise: Spectral Forcing for Pixel-Space Diffusion

Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour $k^{*}(t) = (1-t)^{-2/α}$ separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time $t$. We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator applied to the noisy input before the patch embedder. Its cutoff expands monotonically with the diffusion time and becomes the identity at the data endpoint. Through controlled synthetic experiments, we identify the regime in which the operator is beneficial: coarse patch tokenization and data whose high-frequency content is predominantly noise rather than essential signal. On ImageNet-256 with JiT-700M/32, Spectral Forcing consistently improves both FID and Inception Score across different training epochs, demonstrating robust gains throughout training; at finer tokenization, the spectral forcing is still competitive. We further insert the unchanged operator into SenseNova-U1, a unified text-to-image model, where it improves DPG-Bench and GenEval, showing that the input-side spectral prior transfers beyond class-conditional generation. These results suggest a route to capacity-efficient pixel-space diffusion by showing the signal and hiding the noise.

翻译：像素空间扩散模型在满带宽含噪图像上训练，但去噪器可利用的有效信号具有强频率依赖性。在整流流扩散与自然图像幂律谱条件下，每个时刻 $t$ 的逐频带数据-噪声等高线 $k^{*}(t) = (1-t)^{-2/α}$ 将信号主导的低频区域与噪声主导的高频区域分隔开来。我们揭示这种隐式由粗到细结构不仅是描述性的：它会引发能力分配问题。标准像素空间去噪器必须内部发现移动的带宽边界，并可能在最优预测退化为确定性基线（而非数据分布建模）的频时区域消耗计算资源。为使该边界显式化，我们提出频谱强制方法——一种无参数、时间条件的二维DCT低通算子，在补丁嵌入器之前作用于含噪输入。其截止频率随扩散时间单调扩展，在数据终点退化为恒等变换。通过可控合成实验，我们识别出该算子有效的场景：粗粒度补丁分词化，以及高频内容主要为噪声而非关键信号的数据。在ImageNet-256与JiT-700M/32上，频谱强制方法在不同训练轮次中一致提升FID和Inception分数，展现出贯穿训练过程的稳健增益；在更细粒度的分词化条件下，频谱强制仍具竞争力。我们进一步将未修改的算子集成到统一文生图模型SenseNova-U1中，该方法显著改善DPG-Bench和GenEval指标，表明输入侧频谱先验可迁移至类条件生成以外的任务。这些结果表明，通过显信号、隐噪声的方式，可实现能力高效的像素空间扩散。