Small data learning problems are characterized by a significant discrepancy between the limited amount of response variable observations and the large feature space dimension. In this setting, the common learning tools struggle to identify the features important for the classification task from those that bear no relevant information, and cannot derive an appropriate learning rule which allows to discriminate between different classes. As a potential solution to this problem, here we exploit the idea of reducing and rotating the feature space in a lower-dimensional gauge and propose the Gauge-Optimal Approximate Learning (GOAL) algorithm, which provides an analytically tractable joint solution to the dimension reduction, feature segmentation and classification problems for small data learning problems. We prove that the optimal solution of the GOAL algorithm consists in piecewise-linear functions in the Euclidean space, and that it can be approximated through a monotonically convergent algorithm which presents -- under the assumption of a discrete segmentation of the feature space -- a closed-form solution for each optimization substep and an overall linear iteration cost scaling. The GOAL algorithm has been compared to other state-of-the-art machine learning (ML) tools on both synthetic data and challenging real-world applications from climate science and bioinformatics (i.e., prediction of the El Nino Southern Oscillation and inference of epigenetically-induced gene-activity networks from limited experimental data). The experimental results show that the proposed algorithm outperforms the reported best competitors for these problems both in learning performance and computational cost.
翻译:小数据学习问题的特点在于有限的响应变量观测值与高维特征空间之间存在显著差异。在此情况下,常规学习工具难以从无关信息中识别对分类任务重要的特征,也无法推导出能够区分不同类别的适当学习规则。针对这一问题,本文利用在低维规范中约简和旋转特征空间的思想,提出规范最优近似学习(GOAL)算法,该算法为小数据学习问题的降维、特征分割和分类提供了解析可处理的联合解决方案。我们证明,GOAL算法的最优解由欧几里得空间中的分段线性函数构成,且可通过单调收敛算法近似实现——在特征空间离散分割假设下——每次优化子步骤具有闭式解,且整体线性迭代代价可扩展。将GOAL算法与最新机器学习工具在合成数据以及气候科学与生物信息学领域的具有挑战性的真实应用(即厄尔尼诺-南方涛动预测及从有限实验数据推断表观遗传诱导的基因活性网络)中进行了比较。实验结果表明,所提算法在学习性能和计算成本方面均优于这些问题的现有最优竞争者。