Measuring dataset similarity is fundamental in machine learning, particularly for transfer learning and domain adaptation. In the context of supervised learning, most existing approaches quantify similarity of two data sets based on their input feature distributions, neglecting label information and feature-response alignment. To address this, we propose the Cross-Learning Score (CLS), which measures dataset similarity through bidirectional generalization performance of decision rules. We establish its theoretical foundation by linking CLS to cosine similarity between decision boundaries under canonical linear models, providing a geometric interpretation. A robust ensemble-based estimator is developed that is easy to implement and bypasses high-dimensional density estimation entirely. For transfer learning applications, we introduce a "transferable zones" framework that categorizes source datasets into positive, ambiguous, and negative transfer regions. To accommodate deep learning, we extend CLS to encoder-head architectures, aligning with modern representation-based pipelines. Extensive experiments on synthetic and real-world datasets validate the effectiveness of CLS for similarity measurement and transfer assessment.
翻译:度量数据集相似性是机器学习的基本问题,尤其在迁移学习和领域适应中至关重要。在监督学习背景下,现有方法大多基于输入特征分布来量化两个数据集的相似性,忽略了标签信息及特征-响应对齐。为此,我们提出交叉学习分数(Cross-Learning Score, CLS),通过决策规则的双向泛化性能来度量数据集相似性。我们建立了其理论基础,将CLS与规范线性模型下决策边界之间的余弦相似度联系起来,提供了几何解释。同时开发了一种稳健的集成估计器,不仅易于实现,还完全绕过高维密度估计。针对迁移学习应用,我们引入了"可迁移区域"框架,将源数据集划分为正迁移、模糊迁移和负迁移区域。为适应深度学习,我们将CLS扩展至编码器-头部架构,与基于表示的现代流水线对齐。在合成数据集与真实数据集上的大量实验验证了CLS在相似性度量与迁移评估中的有效性。