This paper introduces a task-specific, model-agnostic framework for evaluating dataset similarity, providing a means to assess and compare dataset realism and quality. Such a framework is crucial for augmenting real-world data, improving benchmarking, and making informed retraining decisions when adapting to new deployment settings, such as different sites or frequency bands. The proposed framework is employed to design metrics based on UMAP topology-preserving dimensionality reduction, leveraging Wasserstein and Euclidean distances on latent space KNN clusters. The designed metrics show correlations above 0.85 between dataset distances and model performances on a channel state information compression unsupervised machine learning task leveraging autoencoder architectures. The results show that the designed metrics outperform traditional methods.
翻译:本文提出了一种任务特定、模型无关的数据集相似性评估框架,为评估和比较数据集的真实性与质量提供了一种方法。该框架对于增强现实世界数据、改进基准测试以及在适应新部署环境(如不同站点或频段)时做出明智的再训练决策至关重要。所提出的框架用于设计基于UMAP拓扑保持降维的度量方法,利用潜在空间KNN聚类上的Wasserstein距离和欧氏距离。在利用自编码器架构的信道状态信息压缩无监督机器学习任务中,所设计的度量显示数据集距离与模型性能之间的相关性高于0.85。结果表明,所设计的度量方法优于传统方法。