One of the major limits of kernel ridge regression (KRR) is that storing and manipulating the kernel matrix K_n for n samples requires O(n^2) space, which rapidly becomes unfeasible for large n. Nystrom approximations reduce the space complexity to O(nm) by sampling m columns from K_n. Uniform sampling preserves KRR accuracy (up to epsilon) only when m is proportional to the maximum degree of freedom of K_n, which may require O(n) columns for datasets with high coherence. Sampling columns according to their ridge leverage scores (RLS) gives accurate Nystrom approximations with m proportional to the effective dimension, but computing exact RLS also requires O(n^2) space. (Calandriello et al. 2016) propose INK-Estimate, an algorithm that processes the dataset incrementally and updates RLS, effective dimension, and Nystrom approximations on-the-fly. Its space complexity scales with the effective dimension but introduces a dependency on the largest eigenvalue of K_n, which in the worst case is O(n). In this paper we introduce SQUEAK, a new algorithm that builds on INK-Estimate but uses unnormalized RLS. As a consequence, the algorithm is simpler, does not need to estimate the effective dimension for normalization, and achieves a space complexity that is only a constant factor worse than exact RLS sampling.
翻译:核岭回归(KRR)的主要限制之一在于,存储和操作n个样本的核矩阵K_n需要O(n^2)空间,这对于大规模n而言迅速变得不可行。Nyström近似通过从K_n中采样m列,将空间复杂度降低至O(nm)。仅当m与K_n的最大自由度成比例时,均匀采样才能将KRR的精度保持在epsilon范围内,而对于高相干性的数据集,这可能需要O(n)列。根据岭杠杆分数(RLS)进行列采样,能以与有效维度成比例的m获得精确的Nyström近似,但计算精确的RLS也需要O(n^2)空间。(Calandriello等人,2016)提出INK-Estimate算法,该算法增量处理数据集并在线更新RLS、有效维度和Nyström近似。其空间复杂度与有效维度成正比,但引入了对K_n最大特征值的依赖,最坏情况下为O(n)。本文引入SQUEAK算法,该算法基于INK-Estimate但使用未归一化的RLS。因此,该算法更简单,无需为归一化估算有效维度,其空间复杂度仅比精确RLS采样多一个常数因子。