We study the stochastic multi-armed bandit (MAB) problem where an underlying network structure enables side-observations across related actions. We use a bipartite graph to link actions to a set of unknowns, such that selecting an action reveals observations for all the unknowns it is connected to. While previous works rely on the assumption that all actions are permanently accessible, we investigate the more practical setting of stochastic availability, where the set of feasible actions (the "activation set") varies dynamically in each round. This framework models real-world systems with both structural dependencies and volatility, such as social networks where users provide side-information about their peers' preferences, yet are not always online to be queried. To address this challenge, we propose UCB-LP-A, a novel policy that leverages a Linear Programming (LP) approach to optimize exploration-exploitation trade-offs under stochastic availability. Unlike standard network bandit algorithms that assume constant access, UCB-LP-A computes an optimal sampling distribution over the realizable activation sets, ensuring that the necessary observations are gathered using only the currently active arms. We derive a theoretical upper bound on the regret of our policy, characterizing the impact of both the network structure and the activation probabilities. Finally, we demonstrate through numerical simulations that UCB-LP-A significantly outperforms existing heuristics that ignore either the side-information or the availability constraints.
翻译:我们研究了随机多臂老虎机问题,其中底层网络结构使得相关动作之间存在侧观测。我们使用二分图将动作与一组未知量关联起来,使得选择某个动作能够揭示与其相连的所有未知量的观测值。以往的研究依赖于所有动作永久可访问的假设,而我们在更实际的随机可用性场景中展开研究,其中可行动作集(即“激活集”)在每个回合中动态变化。该框架为同时具有结构依赖性和波动性的现实系统建模,例如社交网络中用户提供同伴偏好的侧信息,但用户并非始终在线可被查询。为应对这一挑战,我们提出了UCB-LP-A策略,这是一种新颖的方法,通过线性规划优化随机可用性下的探索-利用权衡。与假设恒定访问的标准网络老虎机算法不同,UCB-LP-A针对可实现的激活集计算最优采样分布,确保仅利用当前活跃的动作即可收集必要的观测值。我们推导了该策略遗憾的理论上界,刻画了网络结构和激活概率的影响。最后,通过数值仿真证明UCB-LP-A显著优于忽略侧信息或忽略可用性约束的现有启发式方法。