Generative sequence modeling faces a fundamental tension between the expressivity of Transformers and the efficiency of linear sequence models. Existing efficient architectures are theoretically bounded by shallow, single-step linear updates, while powerful iterative methods like Test-Time Training (TTT) break hardware parallelism due to state-dependent gradients. We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension. PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form. We employ a Write-Forget Decoupling strategy that isolates non-linearity within the injection operator. To bypass the serial dependency of explicit solvers, PRISM utilizes a two-stage proxy architecture: a short-convolution anchors the initial residual using local history energy, while a learned predictor estimates the refinement updates directly from the input. This design distills structural patterns associated with iterative correction into a parallelizable feedforward operator. Theoretically, we prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck. Empirically, it achieves comparable performance to explicit optimization methods while achieving 174x higher throughput.
翻译:生成式序列建模面临一个根本性矛盾:Transformer的表达能力与线性序列模型的高效性之间的权衡。现有高效架构在理论上受限于浅层、单步线性更新,而像测试时训练(TTT)这样强大的迭代方法则由于状态依赖的梯度而破坏了硬件并行性。我们提出PRISM(并行残差迭代序列模型)来解决这一矛盾。PRISM引入了一种受求解器启发的归纳偏置,以可并行化的形式捕捉多步精炼的关键结构特性。我们采用一种“写入-遗忘解耦”策略,将非线性隔离在注入算子内部。为了绕过显式求解器的串行依赖性,PRISM采用了一种两阶段代理架构:一个短卷积利用局部历史能量锚定初始残差,同时一个学习到的预测器直接从输入中估计精炼更新。这一设计将与迭代校正相关的结构模式提炼为一个可并行化的前馈算子。理论上,我们证明了该公式实现了秩-$L$累积,在结构上将更新流形扩展到单步秩-$1$瓶颈之外。实证结果表明,它在实现比显式优化方法高174倍的吞吐量的同时,获得了与之相当的性能。