The classification of random objects within metric spaces without a vector structure has attracted increasing attention. However, the complexity inherent in such non-Euclidean data often restricts existing models to handle only a limited number of features, leaving a gap in real-world applications. To address this, we propose a data-adaptive filtering procedure to identify informative features from large-scale random objects, leveraging a novel Kolmogorov-Smirnov-type statistic defined on the metric space. Our method, applicable to data in general metric spaces with binary labels, exhibits remarkable flexibility. It enjoys a model-free property, as its implementation does not rely on any specified classifier. Theoretically, it controls the false discovery rate while guaranteeing the sure screening property. Empirically, equipped with a Wasserstein metric, it demonstrates superior sample performance compared to Euclidean competitors. When applied to analyze a dataset on autism, our method identifies significant brain regions associated with the condition. Moreover, it reveals distinct interaction patterns among these regions between individuals with and without autism, achieved by filtering hundreds of thousands of covariance matrices representing various brain connectivities.
翻译:在缺乏向量结构的度量空间中随机对象的分类问题日益受到关注。然而,这类非欧几里得数据的内在复杂性常使现有模型仅能处理有限数量的特征,从而在实际应用中存在不足。为解决这一问题,本文提出一种数据自适应的过滤流程,用于从大规模随机对象中识别信息性特征,该流程基于一种定义在度量空间上的新型Kolmogorov-Smirnov型统计量。我们的方法适用于具有二分类标签的一般度量空间数据,展现出显著的灵活性。该方法具有无模型特性,其实现无需依赖任何特定的分类器。理论上,该方法在保证确定筛选性质的同时,能够控制错误发现率。实证上,当配备Wasserstein度量时,该方法相比欧几里得方法表现出更优的样本性能。在应用于自闭症数据集分析时,我们的方法识别出了与该疾病相关的关键脑区。此外,通过过滤代表不同脑连接性的数十万个协方差矩阵,该方法揭示了自闭症患者与非自闭症患者之间这些脑区独特的交互模式差异。