StyleDiff: Attribute Comparison Between Unlabeled Datasets in Latent Disentangled Space

One major challenge in machine learning applications is coping with mismatches between the datasets used in the development and those obtained in real-world applications. These mismatches may lead to inaccurate predictions and errors, resulting in poor product quality and unreliable systems. In this study, we propose StyleDiff to inform developers of the differences between the two datasets for the steady development of machine learning systems. Using disentangled image spaces obtained from recently proposed generative models, StyleDiff compares the two datasets by focusing on attributes in the images and provides an easy-to-understand analysis of the differences between the datasets. The proposed StyleDiff performs in $O (d N\log N)$, where $N$ is the size of the datasets and $d$ is the number of attributes, enabling the application to large datasets. We demonstrate that StyleDiff accurately detects differences between datasets and presents them in an understandable format using, for example, driving scenes datasets.

翻译：机器学习应用面临的一大挑战是开发过程中使用的数据集与实际应用场景中获得的数据集之间的不匹配问题。这种不匹配可能导致预测不准确和错误，进而造成产品质量低下和系统不可靠。本研究提出StyleDiff方法，旨在告知开发者两个数据集之间的差异，以支持机器学习系统的稳定开发。通过利用近期提出的生成模型获得的解耦图像空间，StyleDiff聚焦于图像中的属性对两个数据集进行比较，并以易于理解的方式呈现数据集间的差异分析。所提出的StyleDiff算法复杂度为$O(dN\log N)$，其中$N$为数据集规模，$d$为属性数量，使其能够应用于大规模数据集。我们通过驾驶场景数据集等实例证明，StyleDiff能够准确检测数据集间的差异，并以可理解的形式呈现结果。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日