In recommender systems, generative retrieval typically uses an encoder-decoder setup: an encoder processes a user interaction history, and an autoregressive decoder then generates recommended items. In large-scale streaming services, active users accumulate very long histories over time. As histories grow, the encoder becomes a major latency bottleneck because softmax attention scales quadratically with sequence length. In our experiments, using bidirectional attention in the encoder substantially improves quality. However, most sub-quadratic attention methods focus on causal attention. We propose Gated Bidirectional Linear Attention (GBLA), a linear-time bidirectional attention layer that extends kernelized linear attention with three lightweight components: local causal mixing (Conv1D), sequence-level key gating for soft forgetting, and a gated RMSNorm output. On a large-scale Yandex Music dataset, a hybrid encoder that interleaves self-attention (SA) and GBLA in a 1:2 ratio (one SA block followed by two GBLA blocks) matches bidirectional self-attention quality. On H100 GPUs, GBLA reaches up to an $8.2\times$ single-layer speedup at a history length of 32768, compared to FlashAttention-v3. Finally, we show that the same hybrid design generalizes beyond our proprietary setting, consistently preserving self-attention retrieval quality on public Amazon benchmarks.
翻译:在推荐系统中,生成式检索通常采用编码器-解码器架构:编码器处理用户交互历史,自回归解码器随后生成推荐项目。在大规模流媒体服务中,活跃用户随时间累积出极长的历史记录。随着历史增长,编码器成为主要延迟瓶颈,原因在于 softmax 注意力的计算复杂度随序列长度呈二次方增长。我们的实验表明,在编码器中使用双向注意力可显著提升质量。然而,大多数次二次注意力方法仅专注于因果注意力。我们提出门控双向线性注意力(GBLA),这是一种线性时间复杂度的双向注意力层,通过三个轻量级组件扩展了核化线性注意力:局部因果混合(Conv1D)、用于软遗忘的序列级键门控,以及门控 RMSNorm 输出。在大型 Yandex Music 数据集上,以1:2比例交织自注意力(SA)和 GBLA 的混合编码器(一个SA块后跟两个GBLA块)能达到与双向自注意力相当的质量。在 H100 GPU 上,当历史长度达32768时,GBLA 相对于 FlashAttention-v3 可获得最高8.2倍的单层加速。最后,我们证明相同的混合设计可泛化到专有设置之外,在公开 Amazon 基准上始终能保持自注意力检索质量。