Bandit algorithms to emulate human decision making using probabilistic distortions

Motivated by models of human decision making proposed to explain commonly observed deviations from conventional expected value preferences, we formulate two stochastic multi-armed bandit problems with distorted probabilities on the reward distributions: the classic $K$-armed bandit and the linearly parameterized bandit settings. We consider the aforementioned problems in the regret minimization as well as best arm identification framework for multi-armed bandits. For the regret minimization setting in $K$-armed as well as linear bandit problems, we propose algorithms that are inspired by Upper Confidence Bound (UCB) algorithms, incorporate reward distortions, and exhibit sublinear regret. For the $K$-armed bandit setting, we derive an upper bound on the expected regret for our proposed algorithm, and then we prove a matching lower bound to establish the order-optimality of our algorithm. For the linearly parameterized setting, our algorithm achieves a regret upper bound that is of the same order as that of regular linear bandit algorithm called Optimism in the Face of Uncertainty Linear (OFUL) bandit algorithm, and unlike OFUL, our algorithm handles distortions and an arm-dependent noise model. For the best arm identification problem in the $K$-armed bandit setting, we propose algorithms, derive guarantees on their performance, and also show that these algorithms are order optimal by proving matching fundamental limits on performance. For best arm identification in linear bandits, we propose an algorithm and establish sample complexity guarantees. Finally, we present simulation experiments which demonstrate the advantages resulting from using distortion-aware learning algorithms in a vehicular traffic routing application.

翻译：受人类决策模型（旨在解释与常规期望值偏好相比普遍观察到的偏差）的启发，我们针对带有扭曲概率的奖励分布，提出了两类随机多臂强盗问题：经典K臂强盗和线性参数化强盗设定。我们在多臂强盗的遗憾最小化以及最优臂识别框架下考虑上述问题。针对K臂和线性强盗问题中的遗憾最小化设定，我们提出了受置信上界算法启发、融入奖励扭曲并具有次线性遗憾的算法。对于K臂强盗设定，我们推导了所提算法期望遗憾的上界，并进一步证明了匹配的下界以确定其阶最优性。在线性参数化设定下，我们的算法达到了与常规线性强盗算法（乐观面对不确定性的线性强盗算法）相同的阶次遗憾上界，但不同于该算法，我们的方法能处理扭曲和臂依赖噪声模型。针对K臂强盗设定中的最优臂识别问题，我们提出了相应算法，推导了其性能保证，并通过证明匹配的性能基本极限展示其阶最优性。对于线性强盗中的最优臂识别，我们提出算法并建立了样本复杂度保证。最后，我们通过仿真实验展示了在车辆交通路径选择应用中使用扭曲感知学习算法的优势。