We parameterize the weight matrices of a transformer in the two-dimensional discrete cosine transform (DCT) domain, retaining only the lowest-frequency coefficients. At each forward pass the full weight matrix is reconstructed via the inverse DCT; gradients propagate through the reconstruction to update the spectral coefficients directly. On character-level language modeling (Shakespeare, 1M characters), a 4-layer transformer trained from scratch in this representation matches the perplexity of the standard parameterization (6.1 vs.\ 6.1) while storing 52\% of the parameters. At 4$\times$ compression (29\% of parameters), the model reaches perplexity 6.9 -- outperforming a low-rank baseline (perplexity 8.8 at 21\% of parameters) at a comparable reduction. The method requires no architectural changes, no pre-trained checkpoint, and no auxiliary loss. It reduces to replacing each \texttt{nn.Linear} with a drop-in spectral layer that stores $K$ DCT coefficients instead of $n \times m$ weights.
翻译:我们将Transformer的权重矩阵参数化到二维离散余弦变换(DCT)域中,仅保留最低频的系数。在每次前向传播时,通过逆DCT重建完整的权重矩阵;梯度通过重建过程反向传播,直接更新谱系数。在字符级语言建模任务(莎士比亚文本,100万字符)上,采用该表示从头训练的4层Transformer在存储52%参数的情况下,达到了与标准参数化方法相同的困惑度(均为6.1)。在4倍压缩(仅保留29%参数)时,模型的困惑度为6.9——在相近压缩比下优于低秩基线方法(21%参数时困惑度为8.8)。该方法无需改变架构、无需预训练检查点、无需辅助损失函数,仅需将每个\texttt{nn.Linear}替换为一个即插即用的谱层,该层存储$K$个DCT系数而非$n \times m$个权重。