Indexability of Finite State Restless Multi-Armed Bandit and Rollout Policy

We consider finite state restless multi-armed bandit problem. The decision maker can act on M bandits out of N bandits in each time step. The play of arm (active arm) yields state dependent rewards based on action and when the arm is not played, it also provides rewards based on the state and action. The objective of the decision maker is to maximize the infinite horizon discounted reward. The classical approach to restless bandits is Whittle index policy. In such policy, the M arms with highest indices are played at each time step. Here, one decouples the restless bandits problem by analyzing relaxed constrained restless bandits problem. Then by Lagrangian relaxation problem, one decouples restless bandits problem into N single-armed restless bandit problems. We analyze the single-armed restless bandit. In order to study the Whittle index policy, we show structural results on the single armed bandit model. We define indexability and show indexability in special cases. We propose an alternative approach to verify the indexable criteria for a single armed bandit model using value iteration algorithm. We demonstrate the performance of our algorithm with different examples. We provide insight on condition of indexability of restless bandits using different structural assumptions on transition probability and reward matrices. We also study online rollout policy and discuss the computation complexity of algorithm and compare that with complexity of index computation. Numerical examples illustrate that index policy and rollout policy performs better than myopic policy.

翻译：我们考虑了有限状态的休止多臂老虎机问题。在每个时间步，决策者可以从N个老虎机中选择M个进行操作。对臂的激活（主动臂）会产生基于动作的状态相关奖励，而当臂未被激活时，同样会基于状态和动作提供奖励。决策者的目标是最大化无限时域折扣奖励。解决休止老虎机的经典方法是惠特尔指标策略。在该策略中，每个时间步选择具有最高指标的M个臂进行激活。为此，我们通过分析松弛约束的休止老虎机问题来解耦休止老虎机问题。然后通过拉格朗日松弛法，将休止老虎机问题解耦为N个单臂休止老虎机问题。我们分析了单臂休止老虎机模型。为研究惠特尔指标策略，我们展示了该模型的结构性结果。我们定义了可指标化，并证明了特定情况下的可指标性。提出了一种基于值迭代算法验证单臂老虎机模型可指标准则的替代方法，并通过不同实例展示了算法的性能。基于转移概率和奖励矩阵的不同结构假设，我们深入分析了休止老虎机可指标化的条件。同时，我们研究了在线展放策略，讨论了算法的计算复杂度，并将其与指标计算的复杂度进行了比较。数值实例表明，指标策略和展放策略均优于短视策略。