Modern recommendation systems ought to benefit by probing for and learning from delayed feedback. Research has tended to focus on learning from a user's response to a single recommendation. Such work, which leverages methods of supervised and bandit learning, forgoes learning from the user's subsequent behavior. Where past work has aimed to learn from subsequent behavior, there has been a lack of effective methods for probing to elicit informative delayed feedback. Effective exploration through probing for delayed feedback becomes particularly challenging when rewards are sparse. To address this, we develop deep exploration methods for recommendation systems. In particular, we formulate recommendation as a sequential decision problem and demonstrate benefits of deep exploration over single-step exploration. Our experiments are carried out with high-fidelity industrial-grade simulators and establish large improvements over existing algorithms.
翻译:现代推荐系统应通过探测并学习延迟反馈来获益。现有研究往往聚焦于从用户对单次推荐的响应中学习。这类工作借助监督学习和赌博机学习方法,却忽略了从用户后续行为中学习。即便过往研究尝试从后续行为中学习,仍缺乏有效的探测方法以获取信息丰富的延迟反馈。当奖励信号稀疏时,通过探测实现有效探索以响应延迟反馈变得尤为困难。为解决此问题,我们为推荐系统开发了深度探索方法。具体而言,我们将推荐问题建模为序贯决策过程,并论证了深度探索相较于单步探索的优势。我们借助高保真工业级模拟器开展实验,结果表明该方法较现有算法取得了显著改进。