We present a randomized Kaczmarz method for linear discriminant analysis (rkLDA), an iterative randomized approach to binary-class Gaussian model linear discriminant analysis (LDA) for very large data. We harness a least squares formulation and mobilize the stochastic gradient descent framework to obtain a randomized classifier with performance that can achieve comparable accuracy to that of full data LDA. We present analysis for the expected change in the LDA discriminant function if one employs the randomized Kaczmarz solution in lieu of the full data least squares solution that accounts for both the Gaussian modeling assumptions on the data and algorithmic randomness. Our analysis shows how the expected change depends on quantities inherent in the data such as the scaled condition number and Frobenius norm of the input data, how well the linear model fits the data, and choices from the randomized algorithm. Our experiments demonstrate that rkLDA can offer a viable alternative to full data LDA on a range of step-sizes and numbers of iterations.
翻译:本文提出一种用于线性判别分析(LDA)的随机Kaczmarz方法(rkLDA),这是一种针对超大规模数据的二类高斯模型线性判别分析的迭代随机化方法。我们利用最小二乘公式,并借助随机梯度下降框架,获得一种随机分类器,其性能可达到与全数据LDA相当的准确度。我们分析了若采用随机Kaczmarz解代替全数据最小二乘解时,LDA判别函数期望值的变化,该分析同时考虑了数据的高斯建模假设和算法随机性。我们的分析表明,期望值的变化如何依赖于数据固有的量,如输入数据的缩放条件数和Frobenius范数、线性模型对数据的拟合程度,以及随机化算法中的参数选择。实验证明,在一系列步长和迭代次数下,rkLDA可以成为全数据LDA的一个可行替代方案。