The Problem-oriented AutoML in Clustering (PoAC) framework introduces a novel, flexible approach to automating clustering tasks by addressing the shortcomings of traditional AutoML solutions. Conventional methods often rely on predefined internal Clustering Validity Indexes (CVIs) and static meta-features, limiting their adaptability and effectiveness across diverse clustering tasks. In contrast, PoAC establishes a dynamic connection between the clustering problem, CVIs, and meta-features, allowing users to customize these components based on the specific context and goals of their task. At its core, PoAC employs a surrogate model trained on a large meta-knowledge base of previous clustering datasets and solutions, enabling it to infer the quality of new clustering pipelines and synthesize optimal solutions for unseen datasets. Unlike many AutoML frameworks that are constrained by fixed evaluation metrics and algorithm sets, PoAC is algorithm-agnostic, adapting seamlessly to different clustering problems without requiring additional data or retraining. Experimental results demonstrate that PoAC not only outperforms state-of-the-art frameworks on a variety of datasets but also excels in specific tasks such as data visualization, and highlight its ability to dynamically adjust pipeline configurations based on dataset complexity.
翻译:面向问题的聚类自动机器学习(PoAC)框架提出了一种新颖、灵活的方法来自动化聚类任务,以解决传统AutoML解决方案的不足。传统方法通常依赖于预定义的内部聚类有效性指数(CVI)和静态元特征,限制了其在不同聚类任务中的适应性和有效性。相比之下,PoAC在聚类问题、CVI和元特征之间建立了动态联系,允许用户根据任务的具体背景和目标定制这些组件。PoAC的核心在于使用一个代理模型,该模型在包含大量历史聚类数据集和解决方案的元知识库上进行训练,使其能够推断新聚类流程的质量,并为未见数据集合成最优解决方案。与许多受限于固定评估指标和算法集的AutoML框架不同,PoAC是算法无关的,能够无缝适应不同的聚类问题,无需额外数据或重新训练。实验结果表明,PoAC不仅在多种数据集上优于最先进的框架,而且在数据可视化等特定任务中表现出色,并突显了其根据数据集复杂度动态调整流程配置的能力。