The Hierarchical Kernel Transformer (HKT) is a multi-scale attention mechanism that processes sequences at L resolution levels via trainable causal downsampling, combining level-specific score matrices through learned convex weights. The total computational cost is bounded by 4/3 times that of standard attention, reaching 1.3125x for L = 3. Four theoretical results are established. (i) The hierarchical score matrix defines a positive semidefinite kernel under a sufficient condition on the symmetrised bilinear form (Proposition 3.1). (ii) The asymmetric score matrix decomposes uniquely into a symmetric part controlling reciprocal attention and an antisymmetric part controlling directional attention; HKT provides L independent such pairs across scales, one per resolution level (Propositions 3.5-3.6). (iii) The approximation error decomposes into three interpretable components with an explicit non-Gaussian correction and a geometric decay bound in L (Theorem 4.3, Proposition 4.4). (iv) HKT strictly subsumes single-head standard attention and causal convolution (Proposition 3.4). Experiments over 3 random seeds show consistent gains over retrained standard attention baselines: +4.77pp on synthetic ListOps (55.10+-0.29% vs 50.33+-0.12%, T = 512), +1.44pp on sequential CIFAR-10 (35.45+-0.09% vs 34.01+-0.19%, T = 1,024), and +7.47pp on IMDB character-level sentiment (70.19+-0.57% vs 62.72+-0.40%, T = 1,024), all at 1.31x overhead.
翻译:分层核Transformer(HKT)是一种多尺度注意力机制,通过可训练因果下采样以L个分辨率层级处理序列,并利用学习到的凸权重组合各层级特定的得分矩阵。其总计算成本为标准注意力的4/3倍,当L=3时达到1.3125倍。本文建立了四项理论结果:(i)在对称双线性形式的充分条件下,分层得分矩阵定义了正半定核(命题3.1);(ii)非对称得分矩阵可唯一分解为控制互惠注意力的对称部分与控制方向注意力的反对称部分;HKT在每尺度提供L个独立此类分解对(命题3.5-3.6);(iii)近似误差可分解为三个可解释分量,包含显式非高斯修正项及关于L的几何衰减界(定理4.3,命题4.4);(iv)HKT严格包含单头标准注意力与因果卷积(命题3.4)。基于3个随机种子的实验表明,在1.31倍开销下,HKT相较于重训练的标准注意力基线取得持续提升:合成ListOps任务提升4.77个百分点(55.10±0.29% vs 50.33±0.12%,T=512),序列化CIFAR-10任务提升1.44个百分点(35.45±0.09% vs 34.01±0.19%,T=1024),IMDB字符级情感分析任务提升7.47个百分点(70.19±0.57% vs 62.72±0.40%,T=1024)。