We study the greedy (exploitation-only) algorithm in bandit problems with a known reward structure. We allow arbitrary finite reward structures, while prior work focused on a few specific ones. We fully characterize when the greedy algorithm asymptotically succeeds or fails, in the sense of sublinear vs. linear regret as a function of time. Our characterization identifies a partial identifiability property of the problem instance as the necessary and sufficient condition for the asymptotic success. Notably, once this property holds, the problem becomes easy -- any algorithm will succeed (in the same sense as above), provided it satisfies a mild non-degeneracy condition. We further extend our characterization to contextual bandits and interactive decision-making with arbitrary feedback, and demonstrate its broad applicability across various examples.
翻译:我们研究了在已知奖励结构的多臂赌博机问题中贪婪(仅利用)算法的表现。我们允许任意的有限奖励结构,而先前的研究仅关注少数特定结构。我们完整刻画了贪婪算法何时渐近成功或失败,即遗憾随时间呈次线性与线性增长的区别。我们的刻画将问题实例的部分可辨识性属性识别为渐近成功的充要条件。值得注意的是,一旦该属性成立,问题将变得简单——任何满足温和非退化条件的算法都将成功(在上述相同意义上)。我们进一步将刻画扩展到上下文赌博机及具有任意反馈的交互式决策问题,并通过多个示例证明了其广泛适用性。