''Noisy'' datasets (regimes with low signal to noise ratios, small sample sizes, faulty data collection, etc) remain a key research frontier for classification methods with both theoretical and practical implications. We introduce FINDER, a rigorous framework for analyzing generic classification problems, with tailored algorithms for noisy datasets. FINDER incorporates fundamental stochastic analysis ideas into the feature learning and inference stages to optimally account for the randomness inherent to all empirical datasets. We construct ''stochastic features'' by first viewing empirical datasets as realizations from an underlying random field (without assumptions on its exact distribution) and then mapping them to appropriate Hilbert spaces. The Kosambi-Karhunen-Lo\'eve expansion (KLE) breaks these stochastic features into computable irreducible components, which allow classification over noisy datasets via an eigen-decomposition: data from different classes resides in distinct regions, identified by analyzing the spectrum of the associated operators. We validate FINDER on several challenging, data-deficient scientific domains, producing state of the art breakthroughs in: (i) Alzheimer's Disease stage classification, (ii) Remote sensing detection of deforestation. We end with a discussion on when FINDER is expected to outperform existing methods, its failure modes, and other limitations.
翻译:“噪声”数据集(信噪比低、样本量小、数据收集存在缺陷等情形)仍然是分类方法研究的关键前沿领域,兼具理论与实际意义。本文提出FINDER,一个用于分析通用分类问题的严谨框架,并针对噪声数据集设计了定制化算法。FINDER将基础随机分析思想融入特征学习与推断阶段,以最优方式处理所有经验数据集固有的随机性。我们通过将经验数据集视为底层随机场的实现(无需假设其精确分布),并将其映射到合适的希尔伯特空间,从而构建“随机特征”。Kosambi-Karhunen-Loève展开(KLE)将这些随机特征分解为可计算的不变分量,通过特征值分解实现噪声数据集上的分类:不同类别的数据位于不同区域,这些区域通过分析相关算子的谱结构来识别。我们在多个具有挑战性的数据稀缺科学领域验证了FINDER,取得了以下方面的突破性进展:(i)阿尔茨海默病分期分类,(ii)森林砍伐的遥感检测。最后我们讨论了FINDER在何种情况下可能优于现有方法、其失效模式及其他局限性。