Action-Aware Generative Sequence Modeling for Short Video Recommendation

With the rapid development of the Internet, users have increasingly higher expectations for the recommendation accuracy of online content consumption platforms. However, short videos often contain diverse segments, and users may not hold the same attitude toward all of them. Traditional binary-classification recommendation models, which treat a video as a single holistic entity, face limitations in accurately capturing such nuanced preferences. Considering that user consumption is a temporal process, this paper demonstrates that the timing of user actions can represent diverse intentions through statistical analysis and examination of action patterns. Based on this insight, we propose a novel modeling paradigm: Action-Aware Generative Sequence Network (A2Gen), which refines user actions along the temporal dimension and chains them into sequences for unified processing and prediction. First, we introduce the Context-aware Attention Module (CAM) to model action sequences enriched with item-specific contextual features. Building upon this, we develop the Hierarchical Sequence Encoder (HSE) to learn temporal action patterns from users' historical actions. Finally, through leveraging CAM, we design a module for action sequence generation: the Action-seq Autoregressive Generator (AAG). Extensive offline experiments on the Kuaishou's dataset and the Tmall public dataset demonstrate the superiority of our proposed model. Furthermore, through large-scale online A/B testing deployed on Kuaishou's platform, our model achieves significant improvements over baseline methods in multi-task prediction by leveraging sequential information. Specifically, it yields increases of 0.34% in user watch time, 8.1% in interaction rate, and 0.162% in overall user retention (LifeTime-7), leading to successful deployment across all traffic, serving over 400 million users every day.

翻译：随着互联网的飞速发展，用户对在线内容消费平台的推荐精准度期望日益提升。然而，短视频通常包含多样化的片段，用户对其态度并非完全一致。传统的二元分类推荐模型将视频视为单一整体实体，难以精确捕捉此类细微偏好。考虑到用户消费是一个时程过程，本文通过统计分析及行为模式研究，论证了用户行动的时间节点能够表征多元意图。基于这一发现，我们提出了一种新型建模范式：行动感知生成式序列网络（A2Gen），该网络沿时间维度细化用户行动，并将其串联为序列以进行统一处理与预测。首先，我们引入上下文感知注意力模块（CAM），对融入项目特定上下文特征的行动序列进行建模。在此基础上，我们开发了分层序列编码器（HSE），用于从用户历史行动中学习时序行为模式。最后，借助CAM，我们设计了行动序列生成模块：行动序列自回归生成器（AAG）。在快手数据集与天猫公开数据集上的大规模离线实验表明，所提模型具有优越性。此外，通过在快手平台部署的大规模在线A/B测试，我们的模型利用序列信息在多任务预测中较基线方法取得显著改进。具体而言，用户观看时长提升0.34%，互动率提升8.1%，用户整体留存率（Lifetime-7）提升0.162%，最终实现全流量成功部署，每日服务超过4亿用户。