Computational efficiency is a major bottleneck in using classic graph-based approaches for semi-supervised learning on datasets with a large number of unlabeled examples. Known techniques to improve efficiency typically involve an approximation of the graph regularization objective, but suffer two major drawbacks - first the graph is assumed to be known or constructed with heuristic hyperparameter values, second they do not provide a principled approximation guarantee for learning over the full unlabeled dataset. Building on recent work on learning graphs for semi-supervised learning from multiple datasets for problems from the same domain, and leveraging techniques for fast approximations for solving linear systems in the graph Laplacian matrix, we propose algorithms that overcome both the above limitations. We show a formal separation in the learning-theoretic complexity of sparse and dense graph families. We further show how to approximately learn the best graphs from the sparse families efficiently using the conjugate gradient method. Our approach can also be used to learn the graph efficiently online with sub-linear regret, under mild smoothness assumptions. Our online learning results are stated generally, and may be useful for approximate and efficient parameter tuning in other problems. We implement our approach and demonstrate significant ($\sim$10-100x) speedups over prior work on semi-supervised learning with learned graphs on benchmark datasets.
翻译:计算效率是基于图的方法在半监督学习中处理大量无标签样本时的主要瓶颈。现有提升效率的技术通常涉及图正则化目标的近似,但存在两大缺陷:其一,图被假定为已知或使用启发式超参数值构建;其二,这些方法无法为基于完整无标签数据集的学习提供原则性的近似保证。基于近期从同一领域多个数据集学习半监督学习图的研究成果,并利用图拉普拉斯矩阵中线性系统的快速近似求解技术,我们提出了克服上述两类局限性的算法。我们从学习理论角度揭示了稀疏图族与稠密图族之间形式化的复杂度差异,并进一步展示了如何利用共轭梯度法从稀疏图族中高效近似学习最优图。在温和的光滑性假设下,我们的方法还可用于在线学习,以亚线性遗憾值高效动态学习图。在线学习结果的表述具有通用性,可能有助于其他问题中参数的高效近似调优。我们实现了所提方法,并在基准数据集上展示了相较先前基于学习图的半监督学习方法显著的(约10-100倍)速度提升。