In recommender systems, generative retrieval typically uses an encoder-decoder setup: an encoder processes a user interaction history, and an autoregressive decoder then generates recommended items. In large-scale streaming services, active users accumulate very long histories over time. As histories grow, the encoder becomes a major latency bottleneck because softmax attention scales quadratically with sequence length. In our experiments, using bidirectional attention in the encoder substantially improves quality. However, most sub-quadratic attention methods focus on causal attention. We propose Gated Bidirectional Linear Attention (GBLA), a linear-time bidirectional attention layer that extends kernelized linear attention with three lightweight components: local causal mixing (Conv1D), sequence-level key gating for soft forgetting, and a gated RMSNorm output. On a large-scale Yandex Music dataset, a hybrid encoder that interleaves self-attention (SA) and GBLA in a 1:2 ratio (one SA block followed by two GBLA blocks) matches bidirectional self-attention quality. On H100 GPUs, GBLA reaches up to an $8.2\times$ single-layer speedup at a history length of 32768, compared to FlashAttention-v3. Finally, we show that the same hybrid design generalizes beyond our proprietary setting, consistently preserving self-attention retrieval quality on public Amazon benchmarks.
翻译:在推荐系统中,生成式检索通常采用编码器-解码器架构:编码器处理用户交互历史,自回归解码器随后生成推荐项目。在大规模流媒体服务中,活跃用户随时间累积极长的历史记录。随着历史增长,编码器成为主要延迟瓶颈,因为softmax注意力与序列长度呈二次方缩放。实验表明,在编码器中使用双向注意力可显著提升质量。然而,大多数亚二次注意力方法专注于因果注意力。我们提出门控双向线性注意力(GBLA),这是一种线性时间双向注意力层,通过三个轻量级组件扩展核化线性注意力:局部因果混合(Conv1D)、用于软遗忘的序列级键门控以及门控RMSNorm输出。在大规模Yandex Music数据集上,以1:2比例交错放置自注意力(SA)与GBLA的混合编码器(一个SA块后接两个GBLA块)可匹配双向自注意力的质量。在H100 GPU上,当历史长度达到32768时,GBLA相较于FlashAttention-v3实现最高单层8.2倍加速。最后,我们证明相同混合设计可泛化至专有场景之外,在公开Amazon基准测试上持续保持自注意力检索质量。