Estimating conditional average dose responses (CADR) is an important but challenging problem. Estimators must correctly model the potentially complex relationships between covariates, interventions, doses, and outcomes. In recent years, the machine learning community has shown great interest in developing tailored CADR estimators that target specific challenges. Their performance is typically evaluated against other methods on (semi-) synthetic benchmark datasets. Our paper analyses this practice and shows that using popular benchmark datasets without further analysis is insufficient to judge model performance. Established benchmarks entail multiple challenges, whose impacts must be disentangled. Therefore, we propose a novel decomposition scheme that allows the evaluation of the impact of five distinct components contributing to CADR estimator performance. We apply this scheme to eight popular CADR estimators on four widely-used benchmark datasets, running nearly 1,500 individual experiments. Our results reveal that most established benchmarks are challenging for reasons different from their creators' claims. Notably, confounding, the key challenge tackled by most estimators, is not an issue in any of the considered datasets. We discuss the major implications of our findings and present directions for future research.
翻译:条件平均剂量响应(CADR)的估计是一个重要但具有挑战性的问题。估计器必须正确建模协变量、干预措施、剂量和结果之间潜在复杂的关系。近年来,机器学习界对开发针对特定挑战的定制CADR估计器表现出极大兴趣。它们的性能通常通过在(半)合成基准数据集上与其他方法进行比较来评估。本文分析了这一实践,并表明仅使用流行的基准数据集而不进行进一步分析不足以判断模型性能。既有的基准包含多重挑战,其影响必须被厘清。因此,我们提出了一种新颖的分解方案,能够评估影响CADR估计器性能的五个不同组成部分的贡献。我们将此方案应用于四个广泛使用的基准数据集上的八个流行CADR估计器,进行了近1500次独立实验。我们的结果表明,大多数既有基准之所以具有挑战性,其原因与其创建者声称的不同。值得注意的是,混杂——大多数估计器旨在解决的关键挑战——在所考虑的任何数据集中都不是问题。我们讨论了研究结果的主要影响,并提出了未来研究的方向。