In many real-world computer vision applications, including medical imaging and industrial inspection, binary classification tasks are characterized by a severe scarcity of positive samples. A widely adopted solution is to generate synthetic positive data using image-to-image transformations applied to negative samples. However, a fundamental challenge remains: how can we reliably assess whether such synthetic data will improve downstream model performance? In this work, we propose a geometry-driven metric that predicts the utility of synthetic data without requiring model training. Our approach operates in the embedding space of a pre-trained foundation model and represents the dataset through difference vectors between samples. We evaluate whether the weight vector of a linear classifier can be expressed within the subspace spanned by these variations by measuring the relative projection error. Intuitively, if the variations induced by synthetic data capture task-relevant directions, their span can approximate the classifier, resulting in low projection error. Conversely, poor synthetic data fails to span these directions, leading to higher error. Across multiple datasets and architectures, we show that this metric exhibits strong correlation with downstream classification performance of CNNs trained on mixtures of real negative and synthetic positive data. These findings suggest that the proposed metric serves as a practical and informative tool for evaluating synthetic data quality in data-scarce settings.
翻译:在许多真实世界的计算机视觉应用中,包括医学影像和工业检测,二分类任务面临正样本严重稀缺的问题。一种广泛采用的解决方案是利用图像到图像变换对负样本进行转换以生成合成正样本数据。然而,一个根本性挑战依然存在:我们如何可靠地评估此类合成数据能否提升下游模型性能?在本文中,我们提出一种基于几何测度的度量指标,无需模型训练即可预测合成数据的效用。该方法在预训练基础模型的嵌入空间中运作,通过样本间的差异向量表示数据集。我们通过测量相对投影误差来评估线性分类器的权重向量是否可被这些差异所张成的子空间表达。直观而言,若合成数据引发的差异捕捉到了任务相关方向,则其张成空间可近似于分类器,从而产生低投影误差;反之,劣质合成数据无法张成这些方向,导致误差升高。我们在多个数据集和架构上证明,该度量指标与基于真实负样本与合成正样本混合训练的CNN下游分类性能呈现强相关性。这些发现表明,所提出的度量可作为数据稀缺场景下评估合成数据质量的实用且信息丰富的工具。