Modern recommendation systems ought to benefit by probing for and learning from delayed feedback. Research has tended to focus on learning from a user's response to a single recommendation. Such work, which leverages methods of supervised and bandit learning, forgoes learning from the user's subsequent behavior. Where past work has aimed to learn from subsequent behavior, there has been a lack of effective methods for probing to elicit informative delayed feedback. Effective exploration through probing for delayed feedback becomes particularly challenging when rewards are sparse. To address this, we develop deep exploration methods for recommendation systems. In particular, we formulate recommendation as a sequential decision problem and demonstrate benefits of deep exploration over single-step exploration. Our experiments are carried out with high-fidelity industrial-grade simulators and establish large improvements over existing algorithms.
翻译:现代推荐系统应从探测和学习延迟反馈中受益。现有研究主要关注从用户对单次推荐的响应中学习,这类方法利用监督学习和赌博机学习方法,却忽视了从用户后续行为中学习的可能性。尽管过去有工作尝试从后续行为中学习,但缺乏有效的探测方法来获取信息量丰富的延迟反馈。当奖励稀疏时,通过探测进行有效探索以获取延迟反馈变得尤为困难。为解决这一问题,我们开发了针对推荐系统的深度探索方法。具体而言,我们将推荐问题建模为序贯决策问题,并证明了深度探索相较于单步探索的优势。实验采用高保真工业级仿真器进行,结果表明我们的方法较现有算法取得了显著改进。