Partial Structure Discovery is Sufficient for No-regret Learning in Causal Bandits

Causal knowledge about the relationships among decision variables and a reward variable in a bandit setting can accelerate the learning of an optimal decision. Current works often assume the causal graph is known, which may not always be available a priori. Motivated by this challenge, we focus on the causal bandit problem in scenarios where the underlying causal graph is unknown and may include latent confounders. While intervention on the parents of the reward node is optimal in the absence of latent confounders, this is not necessarily the case in general. Instead, one must consider a set of possibly optimal arms/interventions, each being a special subset of the ancestors of the reward node, making causal discovery beyond the parents of the reward node essential. For regret minimization, we identify that discovering the full causal structure is unnecessary; however, no existing work provides the necessary and sufficient components of the causal graph. We formally characterize the set of necessary and sufficient latent confounders one needs to detect or learn to ensure that all possibly optimal arms are identified correctly. We also propose a randomized algorithm for learning the causal graph with a limited number of samples, providing a sample complexity guarantee for any desired confidence level. In the causal bandit setup, we propose a two-stage approach. In the first stage, we learn the induced subgraph on ancestors of the reward, along with a necessary and sufficient subset of latent confounders, to construct the set of possibly optimal arms. The regret incurred during this phase scales polynomially with respect to the number of nodes in the causal graph. The second phase involves the application of a standard bandit algorithm, such as the UCB algorithm. We also establish a regret bound for our two-phase approach, which is sublinear in the number of rounds.

翻译：在赌博机设置中，关于决策变量与奖励变量间因果关系的知识能够加速最优决策的学习。现有工作通常假设因果图已知，但这并非总是先验可得的。受此挑战驱动，我们聚焦于底层因果图未知且可能包含潜在混杂因子的因果赌博机问题。尽管在无潜在混杂因子的情况下，对奖励节点父节点进行干预是最优的，但在一般情况下却未必如此。相反，必须考虑一组可能最优的臂/干预措施，其中每个干预都是奖励节点祖先节点的特定子集，这使得发现超越奖励节点父节点的因果结构变得至关重要。对于遗憾最小化问题，我们发现无需揭示完整的因果结构；然而，现有研究尚未明确因果图中必要且充分的组成部分。我们形式化地描述了需要检测或学习的必要且充分的潜在混杂因子集合，以确保所有可能最优的臂都能被正确识别。我们还提出了一种随机算法，用于在有限样本数量下学习因果图，并为任意置信水平提供了样本复杂度保证。在因果赌博机框架中，我们提出了一种两阶段方法：第一阶段学习奖励节点祖先上的诱导子图及必要且充分的潜在混杂因子子集，以构建可能最优的臂集合。该阶段产生的遗憾与因果图中节点数量呈多项式关系。第二阶段则应用标准赌博机算法（如UCB算法）。我们还建立了该两阶段方法的遗憾界，其随回合数呈次线性增长。