Traffic allocation is a process of redistributing natural traffic to products by adjusting their positions in the post-search phase, aimed at effectively fostering merchant growth, precisely meeting customer demands, and ensuring the maximization of interests across various parties within e-commerce platforms. Existing methods based on learning to rank neglect the long-term value of traffic allocation, whereas approaches of reinforcement learning suffer from balancing multiple objectives and the difficulties of cold starts within realworld data environments. To address the aforementioned issues, this paper propose a multi-objective deep reinforcement learning framework consisting of multi-objective Q-learning (MOQ), a decision fusion algorithm (DFM) based on the cross-entropy method(CEM), and a progressive data augmentation system(PDA). Specifically. MOQ constructs ensemble RL models, each dedicated to an objective, such as click-through rate, conversion rate, etc. These models individually determine the position of items as actions, aiming to estimate the long-term value of multiple objectives from an individual perspective. Then we employ DFM to dynamically adjust weights among objectives to maximize long-term value, addressing temporal dynamics in objective preferences in e-commerce scenarios. Initially, PDA trained MOQ with simulated data from offline logs. As experiments progressed, it strategically integrated real user interaction data, ultimately replacing the simulated dataset to alleviate distributional shifts and the cold start problem. Experimental results on real-world online e-commerce systems demonstrate the significant improvements of MODRL-TA, and we have successfully deployed MODRL-TA on an e-commerce search platform.
翻译:流量分配是通过调整商品在搜索后阶段的位置来重新分配自然流量的过程,旨在有效促进商家成长、精准满足客户需求,并确保电商平台内各方利益的最大化。现有的基于排序学习的方法忽视了流量分配的长期价值,而强化学习方法则面临平衡多个目标以及在真实数据环境中冷启动困难的问题。为解决上述问题,本文提出了一种多目标深度强化学习框架,该框架包含多目标Q学习(MOQ)、基于交叉熵方法(CEM)的决策融合算法(DFM)以及渐进式数据增强系统(PDA)。具体而言,MOQ构建集成强化学习模型,每个模型专注于一个目标,例如点击率、转化率等。这些模型各自将商品位置作为动作进行决策,旨在从个体视角估计多个目标的长期价值。随后,我们采用DFM动态调整目标间的权重以最大化长期价值,从而应对电商场景中目标偏好的时序动态变化。PDA首先利用离线日志生成的模拟数据训练MOQ。随着实验推进,它策略性地融入真实用户交互数据,最终替代模拟数据集,以缓解分布偏移和冷启动问题。在真实在线电商系统上的实验结果表明,MODRL-TA取得了显著提升,并且我们已成功将MODRL-TA部署于一个电商搜索平台。