The combinatorial pure exploration of causal bandits is the following online learning task: given a causal graph with unknown causal inference distributions, in each round we choose a subset of variables to intervene or do no intervention, and observe the random outcomes of all random variables, with the goal that using as few rounds as possible, we can output an intervention that gives the best (or almost best) expected outcome on the reward variable $Y$ with probability at least $1-\delta$, where $\delta$ is a given confidence level. We provide the first gap-dependent and fully adaptive pure exploration algorithms on two types of causal models -- the binary generalized linear model (BGLM) and general graphs. For BGLM, our algorithm is the first to be designed specifically for this setting and achieves polynomial sample complexity, while all existing algorithms for general graphs have either sample complexity exponential to the graph size or some unreasonable assumptions. For general graphs, our algorithm provides a significant improvement on sample complexity, and it nearly matches the lower bound we prove. Our algorithms achieve such improvement by a novel integration of prior causal bandit algorithms and prior adaptive pure exploration algorithms, the former of which utilize the rich observational feedback in causal bandits but are not adaptive to reward gaps, while the latter of which have the issue in reverse.
翻译:因果强盗的组合纯探索是如下在线学习任务:给定一个因果图,其中因果推断分布未知,每轮我们选择一个变量子集进行干预或不干预,并观测所有随机变量的随机结果,目标是在尽可能少的轮次内,以至少 $1-\delta$ 的概率($\delta$ 为给定置信水平)输出一个能使奖励变量 $Y$ 获得最优(或近似最优)期望结果的干预方案。我们针对两种因果模型——二元广义线性模型(BGLM)和一般图——提出了首个基于间隔差异且完全自适应的纯探索算法。对于BGLM,我们的算法是首个专为该场景设计的,实现了多项式样本复杂度,而现有针对一般图的所有算法要么样本复杂度与图大小呈指数关系,要么依赖某些不合理假设。对于一般图,我们的算法显著提升了样本复杂度,并几乎匹配了我们所证明的下界。该改进源于将先前因果强盗算法(利用因果强盗中丰富的观测反馈但不适应奖励间隔)与先前自适应纯探索算法(存在相反问题)进行创新性融合。