This paper addresses the exploration-exploitation dilemma inherent in decision-making, focusing on multi-armed bandit problems. The problems involve an agent deciding whether to exploit current knowledge for immediate gains or explore new avenues for potential long-term rewards. We here introduce a novel algorithm, approximate information maximization (AIM), which employs an analytical approximation of the entropy gradient to choose which arm to pull at each point in time. AIM matches the performance of Infomax and Thompson sampling while also offering enhanced computational speed, determinism, and tractability. Empirical evaluation of AIM indicates its compliance with the Lai-Robbins asymptotic bound and demonstrates its robustness for a range of priors. Its expression is tunable, which allows for specific optimization in various settings.
翻译:本文探讨了决策过程中固有的探索-利用困境,聚焦于多臂老虎机问题。这类问题涉及智能体需在利用已有知识获取即时收益与探索新路径以获取潜在长期回报之间进行权衡。我们提出了一种新型算法——近似信息最大化(AIM),该算法通过熵梯度的解析近似方法,实时选择最优摇臂。AIM在匹配Infomax算法与汤普森采样性能的同时,显著提升了计算速度、确定性与可解释性。实证评估表明,AIM满足Lai-Robbins渐近下界约束,且对多种先验分布具有稳健性。其表达式具有可调谐特性,可针对不同场景进行特定优化。