Modern recommendation systems ought to benefit by probing for and learning from delayed feedback. Research has tended to focus on learning from a user's response to a single recommendation. Such work, which leverages methods of supervised and bandit learning, forgoes learning from the user's subsequent behavior. Where past work has aimed to learn from subsequent behavior, there has been a lack of effective methods for probing to elicit informative delayed feedback. Effective exploration through probing for delayed feedback becomes particularly challenging when rewards are sparse. To address this, we develop deep exploration methods for recommendation systems. In particular, we formulate recommendation as a sequential decision problem and demonstrate benefits of deep exploration over single-step exploration. Our experiments are carried out with high-fidelity industrial-grade simulators and establish large improvements over existing algorithms.
翻译:现代推荐系统应当能够通过主动探测延迟反馈并从中学习来获益。以往的研究主要集中在从用户对单次推荐的响应中学习,这类工作利用监督学习和多臂老虎机学习的方法,忽略了从用户后续行为中学习的机会。部分既有研究尝试从后续行为中学习,但缺乏有效的探测方法来获取具有信息量的延迟反馈。当奖励信号稀疏时,通过探测实现有效延迟反馈的深度探索变得尤为困难。针对这一问题,我们为推荐系统开发了深度探索方法。具体而言,我们将推荐问题建模为序贯决策问题,并证明了深度探索相较于单步探索的优势。我们的实验采用高保真工业级模拟器进行,结果表明所提方法相较于现有算法取得了显著提升。