Discovering causal relationships requires controlled experiments, but experimentalists face a sequential decision problem: each intervention reveals information that should inform what to try next. Traditional approaches such as random sampling, greedy information maximization, and round-robin coverage treat each decision in isolation, unable to learn adaptive strategies from experience. We propose Active Causal Experimentalist (ACE), which learns experimental design as a sequential policy. Our key insight is that while absolute information gains diminish as knowledge accumulates (making value-based RL unstable), relative comparisons between candidate interventions remain meaningful throughout. ACE exploits this via Direct Preference Optimization, learning from pairwise intervention comparisons rather than non-stationary reward magnitudes. Across synthetic benchmarks, physics simulations, and economic data, ACE achieves 70-71% improvement over baselines at equal intervention budgets (p < 0.001, Cohen's d ~ 2). Notably, the learned policy autonomously discovers that collider mechanisms require concentrated interventions on parent variables, a theoretically-grounded strategy that emerges purely from experience. This suggests preference-based learning can recover principled experimental strategies, complementing theory with learned domain adaptation.
翻译:发现因果关系需要受控实验,但实验者面临一个序贯决策问题:每次干预所揭示的信息都应指导后续尝试的方向。传统方法如随机抽样、贪婪信息最大化与轮询覆盖均孤立处理每个决策,无法从经验中学习自适应策略。我们提出主动因果实验者(ACE),将实验设计作为序贯策略进行学习。我们的核心洞见在于:虽然绝对信息增益随知识积累而递减(导致基于价值的强化学习不稳定),但候选干预间的相对比较始终具有意义。ACE通过直接偏好优化利用这一特性,从成对干预比较中学习,而非依赖非平稳的奖励数值。在合成基准测试、物理模拟与经济数据中,ACE在相同干预预算下较基线方法实现70-71%的性能提升(p < 0.001,Cohen's d ~ 2)。值得注意的是,学习得到的策略自主发现对撞机机制需集中干预父变量——这一理论支撑的策略完全从经验中涌现。这表明基于偏好的学习能够恢复原则性实验策略,通过习得的领域自适应对理论形成补充。