Model selection is critical in the modern statistics and machine learning community. However, most existing works do not apply to heavy-tailed data, which are commonly encountered in real applications, such as the single-cell multiomics data. In this paper, we propose a rank-sum based approach that outputs a confidence set containing the optimal model with guaranteed probability. Motivated by conformal inference, we developed a general method that is applicable without moment or tail assumptions on the data. We demonstrate the advantage of the proposed method through extensive simulation and a real application on the COVID-19 genomics dataset (Stephenson et al., 2021). To perform the inference on rank-sum statistics, we derive a general Gaussian approximation theory for high dimensional two-sample U-statistics, which may be of independent interest to the statistics and machine learning community.
翻译:模型选择在现代统计学与机器学习领域中至关重要。然而,现有方法大多不适用于实际应用中常见的大重尾数据,例如单细胞多组学数据。本文提出一种基于秩和的方法,能够以高保证概率输出包含最优模型的置信集。受共形推断启发,我们开发了一种通用方法,无需对数据的矩或尾部假设即可适用。通过大量模拟实验以及在COVID-19基因组数据集(Stephenson et al., 2021)上的实际应用,我们展示了所提方法的优势。为对秩和统计量进行推断,我们推导了高维双样本U统计量的一般高斯近似理论,这一理论可能对统计学与机器学习领域具有独立的研究价值。