Implicitly Normalized Forecaster (online mirror descent with Tsallis entropy as prox-function) is known to be an optimal algorithm for adversarial multi-armed problems (MAB). However, most of the complexity results rely on bounded rewards or other restrictive assumptions. Recently closely related best-of-both-worlds algorithm were proposed for both adversarial and stochastic heavy-tailed MAB settings. This algorithm is known to be optimal in both settings, but fails to exploit data fully. In this paper, we propose Implicitly Normalized Forecaster with clipping for MAB problems with heavy-tailed distribution on rewards. We derive convergence results under mild assumptions on rewards distribution and show that the proposed method is optimal for both linear and non-linear heavy-tailed stochastic MAB problems. Also we show that algorithm usually performs better compared to best-of-two-worlds algorithm.
翻译:隐式归一化预测器(以Tsallis熵为近端函数的在线镜像下降)被视为对抗性多臂赌博机(MAB)问题的最优算法。然而,大多数复杂度结果依赖于有界奖励或其他限制性假设。最近,针对对抗性和随机重尾MAB设置,提出了密切相关的"两全其美"算法。该算法在两种设置下均被证明是最优的,但未能充分利用数据。本文针对奖励服从重尾分布的MAB问题,提出了带有裁剪的隐式归一化预测器。我们在奖励分布的温和假设下推导了收敛结果,并证明所提方法对线性和非线性重尾随机MAB问题均具有最优性。此外,我们表明该算法的性能通常优于"两全其美"算法。