We study the piecewise stationary combinatorial semi-bandit problem with causally related rewards. In our nonstationary environment, variations in the base arms' distributions, causal relationships between rewards, or both, change the reward generation process. In such an environment, an optimal decision-maker must follow both sources of change and adapt accordingly. The problem becomes aggravated in the combinatorial semi-bandit setting, where the decision-maker only observes the outcome of the selected bundle of arms. The core of our proposed policy is the Upper Confidence Bound (UCB) algorithm. We assume the agent relies on an adaptive approach to overcome the challenge. More specifically, it employs a change-point detector based on the Generalized Likelihood Ratio (GLR) test. Besides, we introduce the notion of group restart as a new alternative restarting strategy in the decision making process in structured environments. Finally, our algorithm integrates a mechanism to trace the variations of the underlying graph structure, which captures the causal relationships between the rewards in the bandit setting. Theoretically, we establish a regret upper bound that reflects the effects of the number of structural- and distribution changes on the performance. The outcome of our numerical experiments in real-world scenarios exhibits applicability and superior performance of our proposal compared to the state-of-the-art benchmarks.
翻译:我们研究了具有因果关联回报的分段平稳组合半臂赌问题。在非平稳环境中,基础臂的分布变化、奖励之间的因果关系变化或两者的共同变化都会改变奖励生成过程。在此类环境下,最优决策者必须同时追踪两类变更源并相应调整策略。这一问题在组合半臂赌设置中更为严峻——决策者仅能观察到所选臂组的结果。我们提出的策略核心是基于上置信界(UCB)算法,并假设智能体采用自适应方法来应对挑战。具体而言,该方法采用基于广义似然比(GLR)检验的变点检测器。此外,我们引入了组重启这一新概念,作为结构化环境中决策过程的新型重启策略。最后,我们的算法整合了追踪底层图结构动态变化的机制,该图结构捕获了赌臂设置中奖励间的因果关系。在理论方面,我们建立了反映结构变化与分布变化次数对性能影响的遗憾上界。在真实场景中的数值实验结果表明,与现有最优基准相比,我们的方法具有优异的适用性和性能优势。