Our work introduces SAVeD (Semantically Aware Version Detection), a contrastive learning-based framework for identifying versions of structured datasets without relying on metadata, labels, or integration-based assumptions. SAVeD addresses a common challenge in data science of repeated labor due to a difficulty of similar work or transformations on datasets. SAVeD employs a modified SimCLR pipeline, generating augmented table views through random transformations (e.g., row deletion, encoding perturbations). These views are embedded via a custom transformer encoder and contrasted in latent space to optimize semantic similarity. Our model learns to minimize distances between augmented views of the same dataset and maximize those between unrelated tables. We evaluate performance using validation accuracy and separation, defined respectively as the proportion of correctly classified version/non-version pairs on a hold-out set, and the difference between average similarities of versioned and non-versioned tables (defined by a benchmark, and not provided to the model). Our experiments span five canonical datasets from the Semantic Versioning in Databases Benchmark, and demonstrate substantial gains post-training. SAVeD achieves significantly higher accuracy on completely unseen tables in, and a significant boost in separation scores, confirming its capability to distinguish semantically altered versions. Compared to untrained baselines and prior state-of-the-art dataset-discovery methods like Starmie, our custom encoder achieves competitive or superior results.
翻译:本研究提出了SAVeD(语义感知版本检测),这是一个基于对比学习的框架,用于在不依赖元数据、标签或基于集成假设的情况下识别结构化数据集的版本。SAVeD解决了数据科学中一个常见挑战:由于对数据集进行类似工作或转换的困难而导致的重复劳动。SAVeD采用改进的SimCLR流程,通过随机转换(例如行删除、编码扰动)生成增强的表格视图。这些视图通过定制的transformer编码器进行嵌入,并在潜在空间中进行对比以优化语义相似性。我们的模型学习最小化同一数据集的增强视图之间的距离,并最大化不相关表格之间的距离。我们使用验证准确率和分离度来评估性能,分别定义为在保留集上正确分类的版本/非版本对的比例,以及版本化表格与非版本化表格平均相似度之间的差异(由基准定义,未提供给模型)。我们的实验涵盖了来自数据库语义版本基准的五个经典数据集,并展示了训练后的显著提升。SAVeD在完全未见过的表格上实现了显著更高的准确率,并且分离度分数显著提升,证实了其区分语义修改版本的能力。与未经训练的基线以及先前最先进的数据集发现方法(如Starmie)相比,我们的定制编码器取得了具有竞争力或更优的结果。