Valid statistical inference is crucial for decision-making but difficult to obtain in supervised learning with multimodal data, e.g., combinations of clinical features, genomic data, and medical images. Multimodal data often warrants the use of black-box algorithms, for instance, random forests or neural networks, which impede the use of traditional variable significance tests. We address this problem by proposing the use of COvariance Measure Tests (COMETs), which are calibrated and powerful tests that can be combined with any sufficiently predictive supervised learning algorithm. We apply COMETs to several high-dimensional, multimodal data sets to illustrate (i) variable significance testing for finding relevant mutations modulating drug-activity, (ii) modality selection for predicting survival in liver cancer patients with multiomics data, and (iii) modality selection with clinical features and medical imaging data. In all applications, COMETs yield results consistent with domain knowledge without requiring data-driven pre-processing which may invalidate type I error control. These novel applications with high-dimensional multimodal data corroborate prior results on the power and robustness of COMETs for significance testing. The comets R package and source code for reproducing all results is available at https://github.com/LucasKook/comets. All data sets used in this work are openly available.
翻译:有效的统计推断对于决策制定至关重要,但在多模态数据(例如临床特征、基因组数据与医学影像的组合)的监督学习中难以实现。多模态数据通常需要使用黑箱算法(如随机森林或神经网络),这阻碍了传统变量显著性检验的应用。我们通过提出协方差度量检验(COMETs)来解决这一问题,这是一种经过校准且具有强大检验效力的方法,可与任何具有充分预测能力的监督学习算法相结合。我们将COMETs应用于多个高维多模态数据集,以展示:(i)寻找调控药物活性的相关突变的变量显著性检验;(ii)利用多组学数据预测肝癌患者生存率的模态选择;(iii)基于临床特征与医学影像数据的模态选择。在所有应用中,COMETs得出的结果与领域知识一致,且无需依赖可能破坏第一类错误控制的数据驱动预处理。这些针对高维多模态数据的新应用验证了COMETs在显著性检验中检验效力和稳健性的先前结论。用于重现所有结果的comets R包及源代码可从https://github.com/LucasKook/comets获取。本研究使用的所有数据集均公开可用。