LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling

Modeling ultra-long user behavior sequences is pivotal for capturing evolving and lifelong interests in modern recommendation systems. However, deploying such models in real-time industrial environments faces a strict "Latency Wall", constrained by two distinct bottlenecks: the high I/O latency of retrieving massive user histories and the quadratic computational complexity of standard attention mechanisms. To break these bottlenecks, we present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote). Our approach tackles the challenges through two complementary innovations: (1) System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories. By implementing a hybrid DRAM-SSD indexing strategy, SeqVault reduces retrieval latency by 50% and CPU usage by 75%, ensuring millisecond-level access to full real-time and life-cycle user histories. (2) Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead. Motivated by the inherent sparsity of user interests, STA employs a sigmoid-based gating strategy that acts as a silence mechanism to filter out noisy items. Subsequently, a lightweight Global Stacked Target Attention (GSTA) module refines these compressed segments to capture cross-segment dependencies without incurring high computational costs. This design performs effective sequence compression, reducing the complexity of long-sequence modeling while preserving critical signals. Extensive offline evaluations demonstrate that LASER consistently outperforms state-of-the-art baselines. In large-scale online A/B testing serving over 100 million daily active users, LASER achieved a 2.36% lift in ADVV and a 2.08% lift in revenue, demonstrating its scalability and significant commercial impact.

翻译：超长用户行为序列建模对于捕捉现代推荐系统中不断演化的终身兴趣至关重要。然而，在实时工业环境中部署此类模型面临着严格的"延迟墙"限制，该限制源于两个不同的瓶颈：检索海量用户历史记录的高I/O延迟，以及标准注意力机制的二次计算复杂度。为突破这些瓶颈，我们提出了LASER，这是一个在小红书（RedNote）开发并部署的全栈优化框架。我们的方法通过两项互补的创新应对挑战：（1）系统效率：我们引入了SeqVault，一个用于长用户历史记录的、统一且模式感知的服务基础设施。通过实施混合DRAM-SSD索引策略，SeqVault将检索延迟降低了50%，CPU使用率降低了75%，确保了对完整实时及全生命周期用户历史记录的毫秒级访问。（2）算法效率：我们提出了一种分段目标注意力（STA）机制来解决计算开销问题。受用户兴趣固有稀疏性的启发，STA采用了一种基于sigmoid的门控策略，该策略作为一种静默机制来过滤噪声项目。随后，一个轻量级的全局堆叠目标注意力（GSTA）模块对这些压缩后的分段进行精炼，以捕获跨分段依赖关系，而不会产生高昂的计算成本。该设计实现了有效的序列压缩，在保留关键信号的同时降低了长序列建模的复杂度。大量的离线评估表明，LASER始终优于最先进的基线模型。在为超过1亿日活跃用户服务的大规模在线A/B测试中，LASER实现了ADVV提升2.36%和收入提升2.08%，证明了其可扩展性和显著的商业影响力。