Quantifying degrees of fusion and separability between data groups in representation space is a fundamental problem in representation learning, particularly under domain shift. A meaningful metric should capture fusion-altering factors like geometric displacement between representation groups, whose variations change the extent of fusion, while remaining invariant to fusion-preserving factors such as global scaling and sampling-induced layout changes, whose variations do not. Existing distributional distance metrics conflate these factors, leading to measures that are not informative of the true extent of fusion between data groups. We introduce Cross-Fusion Distance (CFD), a principled measure that isolates fusion-altering geometry while remaining robust to fusion-preserving variations, with linear computational complexity. We characterize the invariance and sensitivity properties of CFD theoretically and validate them in controlled synthetic experiments. For practical utility on real-world datasets with domain shift, CFD aligns more closely with downstream generalization degradation than commonly used alternatives. Overall, CFD provides a theoretically grounded and interpretable distance measure for representation learning.
翻译:量化表征空间中数据组间的融合与可分离程度是表征学习中的一个基本问题,尤其是在领域偏移场景下。一个有意义的度量应能捕捉改变融合的因素(如表征组间的几何位移,其变化会改变融合程度),同时对保持融合不变的因素(如全局缩放和采样引起的布局变化,其变化不会改变融合程度)保持不变性。现有的分布距离度量混淆了这些因素,导致其测量结果无法真实反映数据组间的融合程度。我们提出了交叉融合距离(CFD),这是一种原则性的度量方法,它能够分离出改变融合的几何因素,同时对保持融合不变的变异具有鲁棒性,并且具有线性计算复杂度。我们从理论上刻画了CFD的不变性和敏感性,并在受控的合成实验中进行了验证。对于存在领域偏移的真实世界数据集,CFD比常用替代方法更贴近下游泛化性能的下降。总体而言,CFD为表征学习提供了一个理论基础扎实且可解释的距离度量。