Optimal Learning for Structured Bandits

We study structured multi-armed bandits, which is the problem of online decision-making under uncertainty in the presence of structural information. In this problem, the decision-maker needs to discover the best course of action despite observing only uncertain rewards over time. The decision-maker is aware of certain convex structural information regarding the reward distributions; that is, the decision-maker knows the reward distributions of the arms belong to a convex compact set. In the presence such structural information, they then would like to minimize their regret by exploiting this information, where the regret is its performance difference against a benchmark policy that knows the best action ahead of time. In the absence of structural information, the classical upper confidence bound (UCB) and Thomson sampling algorithms are well known to suffer minimal regret. As recently pointed out, neither algorithms are, however, capable of exploiting structural information that is commonly available in practice. We propose a novel learning algorithm that we call "DUSA" whose regret matches the information-theoretic regret lower bound up to a constant factor and can handle a wide range of structural information. Our algorithm DUSA solves a dual counterpart of the regret lower bound at the empirical reward distribution and follows its suggested play. We show that this idea leads to the first computationally viable learning policy with asymptotic minimal regret for various structural information, including well-known structured bandits such as linear, Lipschitz, and convex bandits, and novel structured bandits that have not been studied in the literature due to the lack of a unified and flexible framework.

翻译：我们研究了结构化多臂老虎机问题，即在存在结构信息的情况下进行不确定性下的在线决策。在该问题中，决策者需要在观察到随时间变化的不确定奖励的同时，发现最优行动方案。决策者知道关于奖励分布的一类特定凸结构信息，即该决策者知道各臂的奖励分布属于一个凸紧致集合。在存在此类结构信息的情况下，决策者希望通过利用该信息来最小化遗憾值，其中遗憾值是相对于事先知道最优行动的基准策略的性能差异。在缺乏结构信息的情况下，经典的置信上界（UCB）算法和汤普森采样算法已被公认为具有最小遗憾值。然而，正如最近指出的，这两种算法都无法利用实践中常见的结构信息。我们提出了一种名为“DUSA”的新型学习算法，其遗憾值与信息论下界仅相差一个常数因子，并且能够处理广泛的结构信息。我们的DUSA算法通过求解经验奖励分布下遗憾下界的对偶形式，并遵循其建议的动作。我们证明，这一思想可产生首个具有渐近最小遗憾值的计算可行学习策略，适用于多种结构信息场景，包括已知的结构化老虎机（如线性老虎机、利普希茨老虎机与凸老虎机），以及因缺乏统一灵活框架而尚未在文献中研究的新型结构化老虎机。