In conventional supervised learning, a training dataset is given with ground-truth labels from a known label set, and the learned model will classify unseen instances to known labels. This paper studies a new problem setting in which there are unknown classes in the training data misperceived as other labels, and thus their existence appears unknown from the given supervision. We attribute the unknown unknowns to the fact that the training dataset is badly advised by the incompletely perceived label space due to the insufficient feature information. To this end, we propose the exploratory machine learning, which examines and investigates training data by actively augmenting the feature space to discover potentially hidden classes. Our method consists of three ingredients including rejection model, feature exploration, and model cascade. We provide theoretical analysis to justify its superiority, and validate the effectiveness on both synthetic and real datasets.
翻译:在传统的监督学习中,训练数据集包含来自已知标签集的真实标签,学习到的模型会将未见实例分类到已知标签。本文研究一种新的问题设定:训练数据中存在未知类别,这些类别被错误地感知为其他标签,因此在给定的监督信息中其存在性表现为未知。我们将未知未知类归因于训练数据集因特征信息不足而受到不完全感知标签空间的误导。为此,我们提出探索式机器学习,通过主动扩展特征空间来检测和研究训练数据,以发现潜在隐藏的类别。我们的方法包含三个组成部分:拒绝模型、特征探索和模型级联。我们通过理论分析证明了其优越性,并在合成与真实数据集上验证了有效性。