Recency bias is a useful inductive prior for sequential modeling: it emphasizes nearby observations and can still allow longer-range dependencies. Standard Transformer attention lacks this property, relying on all-to-all interactions that overlook the causal and often local structure of temporal data. We propose a simple mechanism to introduce recency bias by reweighting attention scores with a smooth heavy-tailed decay. This adjustment strengthens local temporal dependencies without sacrificing the flexibility to capture broader and data-specific correlations. We show that recency-biased attention consistently improves sequential modeling, aligning Transformer more closely with the read, ignore, and write operations of RNNs. Finally, we demonstrate that our approach achieves competitive and often superior performance on challenging time-series forecasting benchmarks.
翻译:近期偏差是序列建模中一种有用的归纳先验:它强调邻近观测值,同时仍能捕捉长程依赖关系。标准Transformer注意力机制缺乏这一特性,其依赖全连接交互模式,忽略了时间数据的因果性和局部结构。我们提出一种简单机制,通过使用平滑重尾衰减重新加权注意力分数来引入近期偏差。该调整在强化局部时间依赖关系的同时,不牺牲捕捉更广泛数据特定相关性的灵活性。研究表明,近期有偏注意力能持续改进序列建模性能,使Transformer更接近RNN的"读取-忽略-写入"运算模式。最后,我们证明该方法在具有挑战性的时间序列预测基准上取得了具有竞争力且往往更优的表现。