In many modern data sets, High dimension low sample size (HDLSS) data is prevalent in many fields of studies. There has been an increased focus recently on using machine learning and statistical methods to mine valuable information out of these data sets. Thus, there has been an increased interest in efficient learning in high dimensions. Naturally, as the dimension of the input data increases, the learning task will become more difficult, due to increasing computational and statistical complexities. This makes it crucial to overcome the curse of dimensionality in a given dataset, within a reasonable time frame, in a bid to obtain the insights required to keep a competitive edge. To solve HDLSS problems, classical methods such as support vector machines can be utilised to alleviate data piling at the margin. However, when we question geometric domains and their assumptions on input data, we are naturally lead to convex optimisation problems and this gives rise to the development of solutions like distance weighted discrimination (DWD), which can be modelled as a second-order cone programming problem and solved by interior-point methods when sample size and feature dimensions of the data is moderate. In this paper, our focus is on designing an even more scalable and robust algorithm for solving large-scale generalized DWD problems.
翻译:在许多现代数据集中,高维低样本量数据在多个研究领域普遍存在。近年来,利用机器学习和统计方法从这些数据集中挖掘有价值信息的研究日益受到关注,因此高效的高维学习需求与日俱增。随着输入数据维度的增加,由于计算复杂度和统计复杂度的提升,学习任务自然变得更加困难。这使得在合理时间范围内克服给定数据集的维度灾难,从而获得保持竞争优势所需的关键见解变得至关重要。为解决HDLSS问题,可采用支持向量机等经典方法缓解边界处的数据堆积现象。然而,当我们质疑几何域及其对输入数据的假设时,自然会引出凸优化问题,由此催生了距离加权判别(DWD)等解决方案。DWD可建模为二阶锥规划问题,在数据样本量和特征维度适中时可通过内点法求解。本文聚焦于设计更具可扩展性和鲁棒性的算法,以解决大规模广义DWD问题。