Automated machine learning (AutoML) aims to select and configure machine learning algorithms and combine them into machine learning pipelines tailored to a dataset at hand. For supervised learning tasks, most notably binary and multinomial classification, aka single-label classification (SLC), such AutoML approaches have shown promising results. However, the task of multi-label classification (MLC), where data points are associated with a set of class labels instead of a single class label, has received much less attention so far. In the context of multi-label classification, the data-specific selection and configuration of multi-label classifiers are challenging even for experts in the field, as it is a high-dimensional optimization problem with multi-level hierarchical dependencies. While for SLC, the space of machine learning pipelines is already huge, the size of the MLC search space outnumbers the one of SLC by several orders. In the first part of this thesis, we devise a novel AutoML approach for single-label classification tasks optimizing pipelines of machine learning algorithms, consisting of two algorithms at most. This approach is then extended first to optimize pipelines of unlimited length and eventually configure the complex hierarchical structures of multi-label classification methods. Furthermore, we investigate how well AutoML approaches that form the state of the art for single-label classification tasks scale with the increased problem complexity of AutoML for multi-label classification. In the second part, we explore how methods for SLC and MLC could be configured more flexibly to achieve better generalization performance and how to increase the efficiency of execution-based AutoML systems.
翻译:自动化机器学习(AutoML)旨在选择并配置机器学习算法,将其组合成针对特定数据集的机器学习流水线。对于监督学习任务,尤其是二分类与多分类(即单标签分类,SLC),此类AutoML方法已展现出显著成果。然而,多标签分类(MLC)任务——其中数据点不再关联单一类别标签,而是关联一组类别标签——目前受到的关注相对较少。在MLC背景下,针对特定数据选择并配置多标签分类器即使对领域专家也极具挑战性,因其本质上是包含多层级层级依赖关系的高维优化问题。尽管SLC的机器学习流水线空间已相当庞大,MLC的搜索空间规模仍比SLC高出数个数量级。在本文第一部分中,我们为单标签分类任务设计了一种新型AutoML方法,该方法最多优化由两个算法组成的机器学习流水线。随后,该方法被扩展至优化任意长度的流水线,最终实现多标签分类方法复杂层级结构的动态配置。此外,我们探究了当前最先进的单标签分类AutoML方法在应对多标签分类AutoML中更高问题复杂度时的扩展能力。在第二部分中,我们探索如何更灵活地配置SLC与MLC方法以获得更优泛化性能,以及如何提升基于执行的AutoML系统的运行效率。