Many machine learning problems, including similarity learning, ranking, and clustering, rely on empirical pairwise loss functions whose quadratic computational cost quickly becomes prohibitive at scale. We demonstrate how a frugal approach that retains only a fraction of the available information on pairs can achieve estimation or optimization performance comparable to that obtained by using all pairs, by leveraging survey sampling techniques. A central finding, supported by both theory and experiments, is that such sampling plans must target pairs directly rather than individual observations. In particular, for pairwise losses between high-dimensional vectors such as embeddings in vision or graph learning, assigning higher inclusion probabilities to informative pairs using suitable auxiliary information yields performance close to full pairwise evaluation, providing a principled and theoretically grounded trade-off between accuracy and computational cost.
翻译:许多机器学习问题,包括相似性学习、排序和聚类,都依赖于经验成对损失函数,其二次计算复杂度在规模扩大时很快变得难以承受。我们证明,通过利用调查采样技术,仅保留一小部分可用成对信息的经济型方法可以实现与使用所有成对信息相当的估计或优化性能。理论和实验共同支持的核心发现是,此类采样方案必须直接针对成对观察,而非单个观测。特别地,对于高维向量(如视觉或图学习中的嵌入)之间的成对损失,利用合适的辅助信息为信息量大的对分配更高的包含概率,其性能接近完全成对评估,从而在精度与计算成本之间提供了有原则且理论基础坚实的权衡。