Conditional t-SNE (ct-SNE) is a recent extension to t-SNE that allows removal of known cluster information from the embedding, to obtain a visualization revealing structure beyond label information. This is useful, for example, when one wants to factor out unwanted differences between a set of classes. We show that ct-SNE fails in many realistic settings, namely if the data is well clustered over the labels in the original high-dimensional space. We introduce a revised method by conditioning the high-dimensional similarities instead of the low-dimensional similarities and storing within- and across-label nearest neighbors separately. This also enables the use of recently proposed speedups for t-SNE, improving the scalability. From experiments on synthetic data, we find that our proposed method resolves the considered problems and improves the embedding quality. On real data containing batch effects, the expected improvement is not always there. We argue revised ct-SNE is preferable overall, given its improved scalability. The results also highlight new open questions, such as how to handle distance variations between clusters.
翻译:条件t-SNE(ct-SNE)是t-SNE的一种近期扩展方法,它允许从嵌入中移除已知的聚类信息,从而揭示超出标签信息的结构可视化效果。例如,当需要消除不同类别之间非必要差异时,该方法尤为实用。研究表明,ct-SNE在众多实际场景中效果欠佳——特别是当原始高维空间中数据已按标签形成清晰聚类时。我们提出一种修正方法:通过对高维相似性进行条件化(而非低维相似性),并分别存储同类标签与跨类标签的最近邻信息。该方法还可利用近期提出的t-SNE加速技术,从而提升可扩展性。基于合成数据的实验表明,我们提出的方法解决了上述问题并改善了嵌入质量。但在包含批次效应的真实数据中,预期改进并非总能实现。考虑到其更优的可扩展性,我们认为修正版ct-SNE整体上更具优势。研究结果同时揭示了新的开放性问题,例如如何处理聚类间的距离变异。