Machine Learning and AI have the potential to transform data-driven scientific discovery, enabling accurate predictions for several scientific phenomena. As many scientific questions are inherently causal, this paper looks at the causal inference task of treatment effect estimation, where we assume binary effects that are recorded as high-dimensional images in a Randomized Controlled Trial (RCT). Despite being the simplest possible setting and a perfect fit for deep learning, we theoretically find that many common choices in the literature may lead to biased estimates. To test the practical impact of these considerations, we recorded the first real-world benchmark for causal inference downstream tasks on high-dimensional observations as an RCT studying how garden ants (Lasius neglectus) respond to microparticles applied onto their colony members by hygienic grooming. Comparing 6 480 models fine-tuned from state-of-the-art visual backbones, we find that the sampling and modeling choices significantly affect the accuracy of the causal estimate, and that classification accuracy is not a proxy thereof. We further validated the analysis, repeating it on a synthetically generated visual data set controlling the causal model. Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones. Further, we highlight guidelines for representation learning methods to help answer causal questions in the sciences. All code and data will be released.
翻译:机器学习和人工智能具有变革数据驱动科学发现的潜力,能够对多种科学现象进行准确预测。由于许多科学问题本质上是因果性的,本文着眼于处理效应估计这一因果推断任务,其中我们假设在随机对照试验(RCT)中,二元效应被记录为高维图像。尽管这是可能的最简单设置且非常适合深度学习,但我们在理论上发现文献中的许多常见选择可能导致有偏估计。为了检验这些考量的实际影响,我们记录了首个针对高维观测的因果推断下游任务的真实世界基准,该基准基于一项研究花园蚂蚁(Lasius neglectus)如何通过卫生梳理行为对施加于其群体成员的微粒作出反应的随机对照试验。通过比较从最先进视觉骨干网络微调而来的6,480个模型,我们发现抽样和建模选择会显著影响因果估计的准确性,并且分类准确性不能作为其代理指标。我们进一步验证了该分析,在一个合成生成的视觉数据集上重复了实验并控制了因果模型。我们的结果表明,未来的基准测试应仔细考虑真实的下游科学问题,尤其是因果性问题。此外,我们强调了表征学习方法在帮助解答科学中因果问题的指导原则。所有代码和数据都将公开。