Toeplitz Neural Networks (TNNs) (Qin et. al. 2023) are a recent sequence model with impressive results. They require O(n log n) computational complexity and O(n) relative positional encoder (RPE) multi-layer perceptron (MLP) and decay bias calls. We aim to reduce both. We first note that the RPE is a non-SPD (symmetric positive definite) kernel and the Toeplitz matrices are pseudo-Gram matrices. Further 1) the learned kernels display spiky behavior near the main diagonals with otherwise smooth behavior; 2) the RPE MLP is slow. For bidirectional models, this motivates a sparse plus low-rank Toeplitz matrix decomposition. For the sparse component's action, we do a small 1D convolution. For the low rank component, we replace the RPE MLP with linear interpolation and use asymmetric Structured Kernel Interpolation (SKI) (Wilson et. al. 2015) for O(n) complexity: we provide rigorous error analysis. For causal models, "fast" causal masking (Katharopoulos et. al. 2020) negates SKI's benefits. Working in the frequency domain, we avoid an explicit decay bias. To enforce causality, we represent the kernel via the real part of its frequency response using the RPE and compute the imaginary part via a Hilbert transform. This maintains O(n log n) complexity but achieves an absolute speedup. Modeling the frequency response directly is also competitive for bidirectional training, using one fewer FFT. We set a speed state of the art on Long Range Arena (Tay et. al. 2020) with minimal score degradation.
翻译:Toeplitz神经网络(TNNs)(Qin等,2023)是一种近期提出的序列模型,性能表现优异。该模型需要O(n log n)的计算复杂度,以及O(n)次相对位置编码(RPE)多层感知机(MLP)和衰减偏置调用。我们旨在降低这两项成本。首先注意到,RPE是一种非SPD(对称正定)核,且Toeplitz矩阵是伪格拉姆矩阵。进一步发现:1)学习得到的核在主对角线附近呈现尖峰行为,其余区域则表现平滑;2)RPE MLP计算较慢。对于双向模型,这促使我们采用稀疏加低秩的Toeplitz矩阵分解。针对稀疏分量的运算,我们执行小规模的1D卷积;针对低秩分量,我们使用线性插值替代RPE MLP,并采用非对称结构化核插值(SKI)(Wilson等,2015)实现O(n)复杂度:我们提供了严格的误差分析。对于因果模型,"快速"因果掩码(Katharopoulos等,2020)抵消了SKI的优势。通过在频域中工作,我们避免了显式的衰减偏置。为保证因果关系,我们利用RPE通过频率响应的实部表示核,并通过希尔伯特变换计算虚部。这保持了O(n log n)的复杂度,但实现了绝对的加速。直接建模频率响应在双向训练中也具有竞争力,可减少一次FFT的使用。我们在Long Range Arena(Tay等,2020)上实现了速度最优,同时分数损失极小。