Maxway CRT: Improving the Robustness of the Model-X Inference

The model-X conditional randomization test (CRT) is a flexible and powerful testing procedure for the conditional independence hypothesis: X is independent of Y conditioning on Z. Though having many attractive properties, the model-X CRT relies on the model-X assumption that we have perfect knowledge of the distribution of X | Z. If there is an error in modeling the distribution of X | Z, this approach may lose its validity. This problem is even more severe when the adjustment covariates Z are of high dimensionality, in which situation precise modeling of X against Z can be hard. In response to this, we propose the Maxway (Model and Adjust X With the Assistance of Y) CRT, which learns the distribution of Y | Z, and uses it to calibrate the resampling distribution of X to gain robustness to the error in modeling X. We prove that the type-I error inflation of the Maxway CRT can be controlled by the learning error for the low-dimensional adjusting model plus the product of learning errors for X | Z and Y | Z, which could be interpreted as an "almost doubly robust" property. Based on this, we develop implementing algorithms of the Maxway CRT in practical scenarios including (surrogate-assisted) semi-supervised learning and transfer learning where valid information about Y | Z can be potentially provided by some auxiliary or external data. Through extensive simulation studies under different scenarios, we demonstrate that the Maxway CRT achieves significantly better type-I error control than existing model-X inference approaches while preserving similar powers. Finally, we apply our methodology to two real examples, including (1) studying obesity paradox with electronic health record (EHR) data assisted by surrogate variables; (2) inferring the side effect of statins among the ethnic minority group via transferring knowledge from the majority group.

翻译：模型-X条件随机化检验（CRT）是一种灵活且强大的检验程序，用于检验条件独立性假设：在给定Z的条件下，X与Y独立。尽管具有许多吸引人的特性，模型-X CRT依赖于模型-X假设，即我们完全掌握X|Z的分布。若对X|Z分布的建模存在误差，该方法可能失去有效性。当调整协变量Z具有高维性时，这一问题更为严重，此时精确建模X对Z的依赖关系可能较为困难。针对此问题，我们提出Maxway（借助Y协助建模与调整X）CRT，该方法学习Y|Z的分布，并利用其校准X的重采样分布，从而增强对X建模误差的鲁棒性。我们证明，Maxway CRT的第一类错误膨胀可由低维调整模型的学习误差加上X|Z与Y|Z学习误差的乘积来控制，这可被解释为“近似双重稳健”性质。基于此，我们开发了Maxway CRT在实际场景中的实现算法，包括（代理变量辅助的）半监督学习和迁移学习，在这些场景中，辅助数据或外部数据可能提供关于Y|Z的有效信息。通过在不同场景下进行广泛的模拟研究，我们证明Maxway CRT在保持相似统计功效的同时，实现了比现有模型-X推断方法更优的第一类错误控制。最后，我们将该方法应用于两个真实案例：（1）利用代理变量辅助电子健康记录（EHR）数据研究肥胖悖论；（2）通过从多数族裔群体迁移知识推断他汀类药物在少数族裔群体中的副作用。