The Causal Roadmap and Simulations to Improve the Rigor and Reproducibility of Real-Data Applications

The Causal Roadmap outlines a systematic approach to asking and answering questions of cause-and-effect: define the quantity of interest, evaluate needed assumptions, conduct statistical estimation, and carefully interpret results. To protect research integrity, it is essential that the algorithm for statistical estimation and inference be pre-specified prior to conducting any effectiveness analyses. However, it is often unclear which algorithm will perform optimally for the real-data application. Instead, there is a temptation to simply implement one's favorite algorithm -- recycling prior code or relying on the default settings of a computing package. Here, we call for the use of simulations that realistically reflect the application, including key characteristics such as strong confounding and dependent or missing outcomes, to objectively compare candidate estimators and facilitate full specification of the Statistical Analysis Plan. Such simulations are informed by the Causal Roadmap and conducted after data collection but prior to effect estimation. We illustrate with two worked examples. First, in an observational longitudinal study, outcome-blind simulations are used to inform nuisance parameter estimation and variance estimation for longitudinal targeted minimum loss-based estimation (TMLE). Second, in a cluster randomized trial with missing outcomes, treatment-blind simulations are used to examine Type-I error control in Two-Stage TMLE. In both examples, realistic simulations empower us to pre-specify an estimation approach that is expected to have strong finite sample performance and also yield quality-controlled computing code for the actual analysis. Together, this process helps to improve the rigor and reproducibility of our research.

翻译：因果路线图概述了一种系统性的方法来提出和回答因果效应问题：定义目标量、评估所需假设、进行统计估计并谨慎解释结果。为保护研究完整性，必须在进行任何有效性分析之前预先指定统计估计与推断的算法。然而，对于真实数据应用，通常难以确定何种算法将表现最优。相反，研究者往往倾向于直接采用自己偏好的算法——复用已有代码或依赖计算软件包的默认设置。本文倡导使用能真实反映应用场景的仿真模拟，包括强混杂效应及相依或缺失结果等关键特征，以客观比较候选估计量并促进统计分析计划的完整规范。此类仿真基于因果路线图框架，在数据收集完成后、效应估计开始前实施。我们通过两个实例进行说明。首先，在一项观察性纵向研究中，采用结果盲仿真来指导纵向目标最小损失估计（TMLE）的干扰参数估计与方差估计。其次，在一项存在结果缺失的整群随机试验中，通过处理盲仿真检验两阶段TMLE的第一类错误控制。两个案例均表明，真实仿真使我们能够预先指定具有预期强有限样本性能的估计方法，并为实际分析生成经过质量控制的计算代码。这一完整流程有助于提升研究的严谨性与可复现性。