Sequential decision-making under uncertainty is often associated with long feedback delays. Such delays degrade the performance of the learning agent in identifying a subset of arms with the optimal collective reward in the long run. This problem becomes significantly challenging in a non-stationary environment with structural dependencies amongst the reward distributions associated with the arms. Therefore, besides adapting to delays and environmental changes, learning the causal relations alleviates the adverse effects of feedback delay on the decision-making process. We formalize the described setting as a non-stationary and delayed combinatorial semi-bandit problem with causally related rewards. We model the causal relations by a directed graph in a stationary structural equation model. The agent maximizes the long-term average payoff, defined as a linear function of the base arms' rewards. We develop a policy that learns the structural dependencies from delayed feedback and utilizes that to optimize the decision-making while adapting to drifts. We prove a regret bound for the performance of the proposed algorithm. Besides, we evaluate our method via numerical analysis using synthetic and real-world datasets to detect the regions that contribute the most to the spread of Covid-19 in Italy.
翻译:不确定性下的序贯决策通常面临长期反馈延迟。这种延迟会削弱学习代理在长期内识别具有最优集体奖励的臂子集的能力。当环境非平稳且臂的奖励分布之间存在结构依赖时,该问题变得极具挑战性。因此,除了适应延迟和环境变化外,学习因果关系能减轻反馈延迟对决策过程的不利影响。我们将所述场景形式化为一个具有因果相关奖励的非平稳延迟组合半老虎机问题。我们通过平稳结构方程模型中的有向图对因果关系进行建模。代理最大化长期平均收益,该收益定义为基臂奖励的线性函数。我们开发了一种策略,该策略从延迟反馈中学习结构依赖关系,并利用该优化决策过程,同时适应漂移。我们证明了所提算法性能的遗憾界。此外,我们通过使用合成和真实世界数据集进行数值分析来评估我们的方法,以检测对意大利新冠疫情传播贡献最大的区域。