Despite their benefits in terms of simplicity, low computational cost and data requirement, parametric machine learning algorithms, such as linear discriminant analysis, quadratic discriminant analysis or logistic regression, suffer from serious drawbacks including linearity, poor fit of features to the usually imposed normal distribution and high dimensionality. Batch kernel-based nonparametric classifier, which overcomes the linearity and normality of features constraints, represent an interesting alternative for supervised classification problem. However, it suffers from the ``curse of dimension". The problem can be alleviated by the explosive sample size in the era of big data, while large-scale data size presents some challenges in the storage of data and the calculation of the classifier. These challenges make the classical batch nonparametric classifier no longer applicable. This motivates us to develop a fast algorithm adapted to the real-time calculation of the nonparametric classifier in massive as well as streaming data frameworks. This online classifier includes two steps. First, we consider an online principle components analysis to reduce the dimension of the features with a very low computation cost. Then, a stochastic approximation algorithm is deployed to obtain a real-time calculation of the nonparametric classifier. The proposed methods are evaluated and compared to some commonly used machine learning algorithms for real-time fetal well-being monitoring. The study revealed that, in terms of accuracy, the offline (or Batch), as well as, the online classifiers are good competitors to the random forest algorithm. Moreover, we show that the online classifier gives the best trade-off accuracy/computation cost compared to the offline classifier.
翻译:尽管参数化机器学习算法(如线性判别分析、二次判别分析或逻辑回归)具有简单性、计算成本低和数据需求少等优点,但其存在严重缺陷,包括线性特性、特征对通常强加的正态分布拟合不佳以及高维性问题。基于核的批量非参数分类器克服了特征的线性和正态性约束,为监督分类问题提供了一种有前景的替代方案。然而,它受到“维度灾难”的影响。在大数据时代,爆炸性增长的样本量可以缓解该问题,但大规模数据在存储和分类器计算方面也带来了挑战。这些挑战使得经典的批量非参数分类器不再适用。这促使我们开发一种适用于海量数据及流数据框架下非参数分类器实时计算的快速算法。该在线分类器包含两个步骤。首先,我们采用在线主成分分析以极低的计算成本降低特征维度。随后,部署随机逼近算法以实现非参数分类器的实时计算。所提方法通过实时胎儿健康监测场景进行评估,并与若干常用机器学习算法进行比较。研究表明,在准确性方面,离线(或批量)分类器与在线分类器均是随机森林算法的有力竞争者。此外,我们证明在线分类器在准确性/计算成本权衡方面优于离线分类器。