Multi-Task Fusion plays a pivotal role in industrial short-video search systems by aggregating heterogeneous prediction signals into a unified ranking score. However, existing approaches predominantly optimize for immediate engagement metrics, which often fail to align with long-term user satisfaction. While Reinforcement Learning (RL) offers a promising avenue for user satisfaction optimization, its direct application to search scenarios is non-trivial due to the inherent data sparsity and intent constraints compared to recommendation feeds. To this end, we propose SaFRO, a novel framework designed to optimize user satisfaction in short-video search. We first construct a satisfaction-aware reward model that utilizes query-level behavioral proxies to capture holistic user satisfaction beyond item-level interactions. Then we introduce Dual-Relative Policy Optimization (DRPO), an efficient policy learning method that updates the fusion policy through relative preference comparisons within groups and across batches. Furthermore, we design a Task-Relation-Aware Fusion module to explicitly model the interdependencies among different objectives, enabling context-sensitive weight adaptation. Extensive offline evaluations and large-scale online A/B tests on Kuaishou short-video search platform demonstrate that SaFRO significantly outperforms state-of-the-art baselines, delivering substantial gains in both short-term ranking quality and long-term user retention.
翻译:多任务融合在工业级短视频搜索系统中扮演着核心角色,通过将异构预测信号聚合为统一排序分数。然而,现有方法主要优化即时交互指标,往往无法与长期用户满意度对齐。虽然强化学习为优化用户满意度提供了有前景的途径,但与推荐流相比,由于搜索场景固有的数据稀疏性和意图约束,其直接应用并非易事。为此,我们提出SaFRO——一种旨在优化短视频搜索中用户满意度的新型框架。首先构建了满意度感知奖励模型,利用查询级行为代理捕获超越项目级交互的整体用户满意度。随后引入双重相对策略优化(DRPO),一种高效的策略学习方法,通过组内和跨批次间的相对偏好比较来更新融合策略。进一步设计了任务关系感知融合模块,显式建模不同目标间的相互依赖关系,实现上下文敏感的权重自适应。在快手短视频搜索平台上进行的广泛离线评估和大规模在线A/B测试表明,SaFRO显著优于现有最先进基线方法,在短期排序质量和长期用户留存方面均带来实质性提升。