Valid statistical inference is crucial for decision-making but difficult to obtain in supervised learning with multimodal data, e.g., combinations of clinical features, genomic data, and medical images. Multimodal data often warrants the use of black-box algorithms, for instance, random forests or neural networks, which impede the use of traditional variable significance tests. We address this problem by proposing the use of COvariance Measure Tests (COMETs), which are calibrated and powerful tests that can be combined with any sufficiently predictive supervised learning algorithm. We apply COMETs to several high-dimensional, multimodal data sets to illustrate (i) variable significance testing for finding relevant mutations modulating drug-activity, (ii) modality selection for predicting survival in liver cancer patients with multiomics data, and (iii) modality selection with clinical features and medical imaging data. In all applications, COMETs yield results consistent with domain knowledge without requiring data-driven pre-processing which may invalidate type I error control. These novel applications with high-dimensional multimodal data corroborate prior results on the power and robustness of COMETs for significance testing. COMETs are implemented in the comets R package available on CRAN and pycomets Python library available on GitHub. Source code for reproducing all results is available at https://github.com/LucasKook/comets. All data sets used in this work are openly available.
翻译:有效的统计推断对于决策至关重要,但在多模态数据(例如临床特征、基因组数据和医学影像的组合)的监督学习中难以获得。多模态数据通常需要使用黑盒算法,例如随机森林或神经网络,这阻碍了传统变量显著性检验的应用。我们通过提出使用协方差度量检验(COMETs)来解决这一问题,这是一种经过校准且功效强大的检验方法,可与任何具有足够预测能力的监督学习算法结合使用。我们将COMETs应用于多个高维多模态数据集,以说明:(i)通过变量显著性检验发现调控药物活性的相关突变;(ii)利用多组学数据预测肝癌患者生存期的模态选择;(iii)结合临床特征与医学影像数据的模态选择。在所有应用中,COMETs均得出与领域知识一致的结果,且无需可能使第一类错误控制失效的数据驱动预处理。这些在高维多模态数据上的新颖应用证实了先前关于COMETs在显著性检验中功效与稳健性的结论。COMETs已实现于CRAN上的comets R包及GitHub上的pycomets Python库中。重现所有结果的源代码可在https://github.com/LucasKook/comets获取。本工作使用的所有数据集均已公开。