Positional encoding in transformers is commonly implemented through positional embeddings, attention masks, or bias terms, but formal connections between these mechanisms remain limited. We study attention with positional bias through the lens of locality-sensitive hashing (LSH), focusing on Attention with Linear Biases (ALiBi). We show that the ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks induced by a ``positional LSH'' scheme. The empirical mean of masks sampled from this scheme yields spectral norm and max-norm approximation guarantees with bounded block sizes with high probability. This structural theorem implies a uniform approximation theorem for ALiBi-biased attention: with high probability over the sampled masks, the approximate attention output is accurate simultaneously for all query-key-value inputs and can be computed in near-linear time in the context length, reducing long-context ALiBi to a collection of randomized short-context regular (positionally unbiased) attention operations. Conceptually, this connects positional bias, masks, and positional embeddings in a single formal framework and suggests an approach to efficient ALiBi-biased attention. Experiments on large language models validate our theoretical findings.
翻译:在Transformer中,位置编码通常通过位置嵌入、注意力掩码或偏置项实现,但这些机制之间的形式化联系仍然有限。本文通过局部敏感哈希(LSH)的视角研究具有位置偏置的注意力机制,重点关注线性偏置注意力(ALiBi)。我们证明ALiBi偏置矩阵是由“位置LSH”方案诱导的连续分块对角二进制掩码的期望值。从该方案中采样得到的掩码的经验均值,能以高概率在有限分块大小条件下提供谱范数和最大范数逼近保证。该结构定理蕴含了ALiBi偏置注意力的一致性逼近定理:在采样的掩码上以高概率,近似注意力输出对所有的查询-键-值输入同时保持精确,并且可以在上下文长度上以近线性时间计算,从而将长上下文ALiBi简化为一系列随机短上下文常规(无位置偏置)注意力操作的集合。从概念上讲,这在一个统一的形式化框架中连接了位置偏置、掩码和位置嵌入,并提出了实现高效ALiBi偏置注意力的方法。在大语言模型上的实验验证了我们的理论发现。