Multi-armed bandit (MAB) algorithms have achieved significant success in sequential decision-making applications, under the premise that humans perfectly implement the recommended policy. However, existing methods often overlook the crucial factor of human trust in learning algorithms. When trust is lacking, humans may deviate from the recommended policy, leading to undesired learning performance. Motivated by this gap, we study the trust-aware MAB problem by integrating a dynamic trust model into the standard MAB framework. Specifically, it assumes that the recommended and actually implemented policy differs depending on human trust, which in turn evolves with the quality of the recommended policy. We establish the minimax regret in the presence of the trust issue and demonstrate the suboptimality of vanilla MAB algorithms such as the upper confidence bound (UCB) algorithm. To overcome this limitation, we introduce a novel two-stage trust-aware procedure that provably attains near-optimal statistical guarantees. A simulation study is conducted to illustrate the benefits of our proposed algorithm when dealing with the trust issue.
翻译:多臂老虎机(MAB)算法在序贯决策应用中取得了显著成功,其前提是决策者完美地执行推荐策略。然而,现有方法往往忽视了人类对学习算法的信任这一关键因素。当信任缺失时,人类可能偏离推荐策略,导致不理想的学习性能。受此研究空白启发,我们通过将动态信任模型整合到标准MAB框架中,研究信任感知的MAB问题。具体而言,该模型假设推荐策略与实际执行策略的差异取决于人类信任度,而信任度又随着推荐策略的质量动态演化。我们建立了存在信任问题时的极小极大遗憾界,并证明了经典MAB算法(如上置信界(UCB)算法)的次优性。为克服这一局限,我们提出了一种新颖的两阶段信任感知算法,该算法在理论上可达到接近最优的统计保证。通过仿真研究,我们展示了所提算法在处理信任问题时的优势。