In our approach, we consider the data as instances of a random field within a relevant Bochner space. Our key observation is that the classes can predominantly reside in two distinct subspaces. To uncover the separation between these classes, we employ the Karhunen-Loeve expansion and construct the appropriate subspaces. This allows us to effectively reveal the distinction between the classes. The novel features forming the above bases are constructed by applying a coordinate transformation based on the recent Functional Data Analysis theory for anomaly detection. The associated signal decomposition is an exact hierarchical tensor product expansion with known optimality properties for approximating stochastic processes (random fields) with finite dimensional function spaces. Using a hierarchical finite dimensional expansion of the nominal class, a series of orthogonal nested subspaces is constructed for detecting anomalous signal components. Projection coefficients of input data in these subspaces are then used to train a Machine Learning (ML classifier. However, due to the split of the signal into nominal and anomalous projection components, clearer separation surfaces for the classes arise. In fact we show that with a sufficiently accurate estimation of the covariance structure of the nominal class, a sharp classification can be obtained. This is particularly advantageous for large unbalanced datasets. We demonstrate it on a number of high-dimensional datasets. This approach yields significant increases in accuracy of ML methods compared to using the same ML algorithm with the original feature data. Our tests on the Alzheimer's Disease ADNI dataset shows a dramatic increase in accuracy (from 48% to 89% accuracy). Furthermore, tests using unbalanced semi-synthetic datasets created from the benchmark GCM dataset confirm increased accuracy as the dataset becomes more unbalanced.
翻译:在本方法中,我们将数据视为相关Bochner空间内随机场的实例。核心观察在于,不同类别的数据主要分布于两个不同的子空间中。为揭示类别间的分离特性,我们利用Karhunen-Loeve展开并构建相应子空间,有效区分了各类别。基于最近用于异常检测的函数数据分析理论,通过坐标变换构建了构成上述基的新特征。相应的信号分解是一种精确的分层张量积展开,具备用有限维函数空间逼近随机过程(随机场)的已知最优性质。通过名义类别的分层有限维展开,构建一系列正交嵌套子空间以检测异常信号分量。随后,利用输入数据在这些子空间中的投影系数训练机器学习分类器。由于信号被分解为名义分量与异常投影分量,类别间形成了更清晰的分离界面。我们证明,当名义类别的协方差结构被充分准确估计时,可实现精确分类。该方法尤其适用于大规模非平衡数据集,并在多个高维数据集上得到验证。与直接使用原始特征数据的同类机器学习算法相比,本方法显著提升了分类准确率。针对阿尔茨海默病ADNI数据集的测试显示,准确率从48%跃升至89%。此外,基于基准GCM数据集构建的非平衡半合成数据集测试进一步证实:随着数据非平衡程度加剧,该方法仍能保持准确率提升。