The bandit paradigm provides a unified modeling framework for problems that require decision-making under uncertainty. Because many business metrics can be viewed as rewards (a.k.a. utilities) that result from actions, bandit algorithms have seen a large and growing interest from industrial applications, such as search, recommendation and advertising. Indeed, with the bandit lens comes the promise of direct optimisation for the metrics we care about. Nevertheless, the road to successfully applying bandits in production is not an easy one. Even when the action space and rewards are well-defined, practitioners still need to make decisions regarding multi-arm or contextual approaches, on- or off-policy setups, delayed or immediate feedback, myopic or long-term optimisation, etc. To make matters worse, industrial platforms typically give rise to large action spaces in which existing approaches tend to break down. The research literature on these topics is broad and vast, but this can overwhelm practitioners, whose primary aim is to solve practical problems, and therefore need to decide on a specific instantiation or approach for each project. This tutorial will take a step towards filling that gap between the theory and practice of bandits. Our goal is to present a unified overview of the field and its existing terminology, concepts and algorithms -- with a focus on problems relevant to industry. We hope our industrial perspective will help future practitioners who wish to leverage the bandit paradigm for their application.
翻译:多臂老虎机(bandit)范式为需要在不确定性下做出决策的问题提供了统一的建模框架。由于许多业务指标可被视为行为产生的奖励(即效用),多臂老虎机算法在搜索、推荐和广告等工业应用中引起了日益增长的广泛兴趣。确实,通过多臂老虎机的视角,我们有望直接优化所关注的指标。然而,在生产环境中成功应用多臂老虎机并非易事。即使动作空间和奖励定义明确,实践者仍需在诸多方面做出决策:采用多臂还是上下文方法、在线还是离线策略设置、延迟还是即时反馈、短视还是长期优化等。更棘手的是,工业平台通常会产生大规模动作空间,而现有方法往往难以应对。这些主题的研究文献虽然广泛而丰富,但这可能令以解决实际问题为首要目标的实践者感到困惑,因为他们需要为每个项目选择特定的实例化方案或方法。本教程旨在弥合多臂老虎机理论与实践的鸿沟。我们的目标是统一呈现该领域及其现有术语、概念和算法——重点关注与工业相关的问题。希望我们的工业视角能帮助未来希望在其应用中利用多臂老虎机范式的实践者。