Neighbour embeddings (NE) allow the representation of high dimensional datasets into lower dimensional spaces and are often used in data visualisation. In practice, accelerated approximations are employed to handle very large datasets. Accelerating NE is challenging, and two main directions have been explored: very coarse approximations based on negative sampling (as in UMAP) achieve high effective speed but may lack quality in the extracted structures; less coarse approximations, as used in FIt-SNE or BH-t-SNE, offer better structure preservation at the cost of speed, while also restricting the target dimensionality to 2 or 3, limiting NE to visualisation. In some variants, the precision of these costlier accelerations also enables finer-grained control on the extracted structures through dedicated hyperparameters. This paper proposes to bridge the gab between both approaches by introducing a novel way to accelerate NE, requiring a small number of computations per iteration while maintaining good fine-grained structure preservation and flexibility through hyperparameter tuning, without limiting the dimensionality of the embedding space. The method was designed for interactive exploration of data; as such, it abandons the traditional two-phased approach of other NE methods, allowing instantaneous visual feedback when changing hyperparameters, even when these control processes happening on the high-dimensional side of the computations. Experiments using a publicly available, GPU accelerated GUI integration of the method show promising results in terms of speed, flexibility in the structures getting extracted, and show potential uses in broader machine learning contexts with minimal algorithmic modifications. Central to this algorithm is a novel approach to iterative approximate nearest neighbour search, which shows promising results compared to nearest neighbour descent.
翻译:邻域嵌入(NE)能够将高维数据集表示到低维空间中,常用于数据可视化。在实际应用中,通常采用加速近似方法来处理超大规模数据集。加速NE具有挑战性,目前主要探索了两个方向:基于负采样的极粗略近似方法(如UMAP)能实现极高的有效速度,但可能损失提取结构的质量;而较不粗略的近似方法(如FIt-SNE或BH-t-SNE)以速度为代价提供了更好的结构保持性,同时将目标维度限制在2或3维,使得NE仅适用于可视化。在某些变体中,这些计算成本更高的加速方法通过专用超参数还能实现对提取结构的更精细控制。本文旨在通过引入一种新颖的NE加速方法来弥合两种途径之间的鸿沟,该方法在每次迭代中仅需少量计算,同时通过超参数调优保持良好的细粒度结构保持性和灵活性,且不限制嵌入空间的维度。该方法专为数据的交互式探索而设计;因此它摒弃了其他NE方法传统的两阶段流程,即使在控制高维计算过程的超参数发生改变时,也能实现即时视觉反馈。通过使用公开可得的、GPU加速的图形界面集成进行实验,结果表明该方法在速度、提取结构的灵活性方面表现优异,并展现了通过最小算法修改在更广泛机器学习场景中的应用潜力。该算法的核心是一种新颖的迭代式近似最近邻搜索方法,与最近邻下降法相比显示出良好的性能前景。