Spatio-temporal models analyze spatial structures and temporal dynamics, which makes them prone to information degeneration among space and time. Prior literature has demonstrated that over-squashing in causal attention or temporal convolutions creates a bias on the first tokens. To analyze whether such a bias is present in temporal attention mechanisms, we derive sensitivity bounds on the expected value of the Jacobian of a temporal attention layer. We theoretically show how off-diagonal attention scores depend on the sequence length, and that temporal attention matrices suffer a diagonal attention sink. We suggest regularization methods, and experimentally demonstrate their effectiveness.
翻译:时空模型通过分析空间结构与时间动态,常面临时空信息退化的问题。已有研究表明,因果注意力或时序卷积中的过度压缩会对初始标记产生系统性偏差。为探究此类偏差是否存在于时序注意力机制中,我们推导了时序注意力层雅可比矩阵期望值的敏感度边界。理论分析表明非对角线注意力分数与序列长度的依赖关系,并揭示时序注意力矩阵存在对角线注意力沉没现象。我们提出相应的正则化方法,并通过实验验证其有效性。