Dimensionality reduction methods such as UMAP and t-SNE are central tools for visualising high-dimensional data, but their local-neighborhood objectives can preserve sampling noise while distorting global topology. We show that standard local metrics reward this noise memorisation: top-performing embeddings invent cycles and disconnected islands absent from the data. We introduce a topology-faithfulness benchmark based on noisy manifolds with known homology, tune DiRe against it, and find Pareto-optimal configurations that match or beat GPU-accelerated UMAP on classification while recovering exact first Betti numbers on stress tests. On 723K arXiv paper embeddings, DiRe preserves 3-4 times more topological structure than UMAP at comparable wall-clock.
翻译:UMAP和t-SNE等降维方法是可视化高维数据的核心工具,但其局部邻域目标函数在保留采样噪声的同时会扭曲全局拓扑结构。我们证明,标准局部指标会奖励这种噪声记忆:性能最佳的嵌入会生成数据中不存在的环状结构和孤立岛。我们基于具有已知同调性的噪声流形构建了拓扑保真度基准,并据此调优DiRe,发现了在分类任务中匹配或超越GPU加速版UMAP,同时在压力测试中恢复精确第一贝蒂数的帕累托最优配置。在723K个arXiv论文嵌入上,DiRe在相近的挂钟时间内保留了比UMAP多3-4倍的拓扑结构。