Show the Signal, Hide the Noise: Spectral Forcing for Pixel-Space Diffusion

Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour $k^{*}(t) = (1-t)^{-2/α}$ separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time $t$. We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator applied to the noisy input before the patch embedder. Its cutoff expands monotonically with the diffusion time and becomes the identity at the data endpoint. Through controlled synthetic experiments, we identify the regime in which the operator is beneficial: coarse patch tokenization and data whose high-frequency content is predominantly noise rather than essential signal. On ImageNet-256 with JiT-700M/32, Spectral Forcing consistently improves both FID and Inception Score across different training epochs, demonstrating robust gains throughout training; at finer tokenization, the spectral forcing is still competitive. We further insert the unchanged operator into SenseNova-U1, a unified text-to-image model, where it improves DPG-Bench and GenEval, showing that the input-side spectral prior transfers beyond class-conditional generation. These results suggest a route to capacity-efficient pixel-space diffusion by showing the signal and hiding the noise.

翻译：像素空间扩散模型在满带宽噪声图像上进行训练，但去噪器可用的有效信号强烈依赖于频率。在整流流扩散和自然图像幂律谱下，每个时间步 $t$ 上的逐频带数据-噪声等高线 $k^{*}(t) = (1-t)^{-2/α}$ 将信号主导的低频区域与噪声主导的高频区域分开。我们证明这种隐式的由粗到细结构不仅仅是描述性的：它引发了容量分配问题。标准像素空间去噪器必须在内部发现移动的带宽边界，并且可能将计算资源花费在频-时区域上，其中最优预测退化为确定性基线，而非数据分布建模。为了使该边界显式化，我们引入了谱强制（Spectral Forcing），这是一种无参数、时间条件化的二维离散余弦变换（2D-DCT）低通算子，应用于补丁嵌入器之前的噪声输入。其截止频率随扩散时间单调扩展，并在数据端点处变为恒等算子。通过受控的合成实验，我们识别出该算子有益的适用场景：粗粒度的补丁分词化以及数据中高频内容主要为噪声而非必需信号的情况。在采用JiT-700M/32的ImageNet-256上，谱强制在不同训练周期内一致地改善了FID和Inception Score，展示了贯穿训练过程的稳健提升；在更细粒度的分词化下，谱强制仍具有竞争力。我们进一步将未经修改的算子插入统一文本到图像模型SenseNova-U1中，改善了DPG-Bench和GenEval指标，表明输入侧的谱先验能够迁移至类条件生成之外。这些结果通过“展示信号、隐藏噪声”的方式，为构建容量高效的像素空间扩散模型提供了一条路径。