We study the challenging exploration incentive problem in both bandit and reinforcement learning, where the rewards are scale-free and potentially unbounded, driven by real-world scenarios and differing from existing work. Past works in reinforcement learning either assume costly interactions with an environment or propose algorithms finding potentially low quality local maxima. Motivated by EXP-type methods that integrate multiple agents (experts) for exploration in bandits with the assumption that rewards are bounded, we propose new algorithms, namely EXP4.P and EXP4-RL for exploration in the unbounded reward case, and demonstrate their effectiveness in these new settings. Unbounded rewards introduce challenges as the regret cannot be limited by the number of trials, and selecting suboptimal arms may lead to infinite regret. Specifically, we establish EXP4.P's regret upper bounds in both bounded and unbounded linear and stochastic contextual bandits. Surprisingly, we also find that by including one sufficiently competent expert, EXP4.P can achieve global optimality in the linear case. This unbounded reward result is also applicable to a revised version of EXP3.P in the Multi-armed Bandit scenario. In EXP4-RL, we extend EXP4.P from bandit scenarios to reinforcement learning to incentivize exploration by multiple agents, including one high-performing agent, for both efficiency and excellence. This algorithm has been tested on difficult-to-explore games and shows significant improvements in exploration compared to state-of-the-art.
翻译:我们研究赌博机和强化学习中具有挑战性的探索激励问题,其中奖励是无尺度且可能无界的。这一问题源于实际应用场景,与现有工作不同。过去的强化学习工作要么假设与环境交互成本昂贵,要么提出可能找到低质量局部最优解的算法。受EXP类方法的启发——这些方法在奖励有界的假设下通过集成多个智能体(专家)进行赌博机探索,我们提出了新算法,即EXP4.P和EXP4-RL,用于无界奖励情况下的探索,并证明了它们在新环境中的有效性。无界奖励带来了挑战,因为遗憾不能通过尝试次数来限定,且选择次优臂可能导致无限遗憾。具体地,我们在有界和无界的线性及随机上下文赌博机中建立了EXP4.P的遗憾上界。令人惊讶的是,我们还发现,通过包含一个足够胜任的专家,EXP4.P在线性情况下能够实现全局最优。这一无界奖励结果也适用于多臂赌博机场景中EXP3.P的修订版本。在EXP4-RL中,我们将EXP4.P从赌博机场景扩展到强化学习,通过多个智能体(包括一个高性能智能体)激励探索,同时兼顾效率与卓越性。该算法已在难以探索的游戏中进行了测试,与现有最先进算法相比,在探索性能上表现出显著改善。