The era of huge data necessitates highly efficient machine learning algorithms. Many common machine learning algorithms, however, rely on computationally intensive subroutines that are prohibitively expensive on large datasets. Oftentimes, existing techniques subsample the data or use other methods to improve computational efficiency, at the expense of incurring some approximation error. This thesis demonstrates that it is often sufficient, instead, to substitute computationally intensive subroutines with a special kind of randomized counterparts that results in almost no degradation in quality.
翻译:大数据时代迫切需要高效的机器学习算法。然而,许多常见的机器学习算法依赖于计算密集型子程序,这些子程序在处理大规模数据集时成本过高,难以实际应用。现有技术通常通过子采样数据或采用其他方法来提升计算效率,但这往往以引入一定近似误差为代价。本论文证明,用一类特殊的随机化替代方案取代计算密集型子程序,往往就足以在几乎不降低算法质量的前提下实现效率提升。