One major challenge in machine learning applications is coping with mismatches between the datasets used in the development and those obtained in real-world applications. These mismatches may lead to inaccurate predictions and errors, resulting in poor product quality and unreliable systems. In this study, we propose StyleDiff to inform developers of the differences between the two datasets for the steady development of machine learning systems. Using disentangled image spaces obtained from recently proposed generative models, StyleDiff compares the two datasets by focusing on attributes in the images and provides an easy-to-understand analysis of the differences between the datasets. The proposed StyleDiff performs in $O (d N\log N)$, where $N$ is the size of the datasets and $d$ is the number of attributes, enabling the application to large datasets. We demonstrate that StyleDiff accurately detects differences between datasets and presents them in an understandable format using, for example, driving scenes datasets.
翻译:摘要:机器学习应用中的一大挑战是处理开发阶段使用的数据集与实际应用场景中获取的数据集之间的不匹配。这种不匹配可能导致预测不准确和错误,进而引发产品质量低下和系统不可靠。在本研究中,我们提出StyleDiff方法,旨在为机器学习系统的稳定开发提供数据集间差异的告知。利用最新生成模型所获得的解耦图像空间,StyleDiff通过聚焦图像中的属性对两个数据集进行比较,并提供易于理解的数据集差异分析。所提出的StyleDiff算法时间复杂度为$O(d N\log N)$,其中$N$为数据集规模,$d$为属性数量,使其能够应用于大规模数据集。我们通过驾驶场景数据集等实例证明,StyleDiff能准确检测数据集间的差异,并以可理解的形式呈现这些差异。