Tabular data is the primary data format in industrial relational databases, underpinning modern data analytics and decision-making. However, the increasing scale of tabular data poses significant computational and storage challenges to learning-based analytical systems. This highlights the need for data-efficient learning, which enables effective model training and generalization using substantially fewer samples. Dataset condensation (DC) has emerged as a promising data-centric paradigm that synthesizes small yet informative datasets to preserve data utility while reducing storage and training costs. However, existing DC methods are computationally intensive due to reliance on complex gradient-based optimization. Moreover, they often overlook key characteristics of tabular data, such as heterogeneous features and class imbalance. To address these limitations, we introduce C$^{2}$TC (Class-Adaptive Clustering for Tabular Condensation), the first training-free tabular dataset condensation framework that jointly optimizes class allocation and feature representation, enabling efficient and scalable condensation. Specifically, we reformulate the dataset condensation objective into a novel class-adaptive cluster allocation problem (CCAP), which eliminates costly training and integrates adaptive label allocation to handle class imbalance. To solve the NP-hard CCAP, we develop HFILS, a heuristic local search that alternates between soft allocation and class-wise clustering to efficiently obtain high-quality solutions. Moreover, a hybrid categorical feature encoding (HCFE) is proposed for semantics-preserving clustering of heterogeneous discrete attributes. Extensive experiments on 10 real-world datasets demonstrate that C$^{2}$TC improves efficiency by at least 2 orders of magnitude over state-of-the-art baselines, while achieving superior downstream performance.
翻译:表格数据是工业关系数据库中的主要数据格式,支撑着现代数据分析和决策制定。然而,表格数据规模的不断增长给基于学习的分析系统带来了巨大的计算和存储挑战。这凸显了对数据高效学习的需求,即使用显著更少的样本实现有效的模型训练和泛化。数据集浓缩作为一种以数据为中心的前沿范式应运而生,它通过合成小型但信息丰富的数据集来保持数据效用,同时降低存储和训练成本。然而,现有数据集浓缩方法因依赖复杂的基于梯度的优化而计算密集。此外,它们常常忽视表格数据的关键特征,例如异构特征和类别不平衡。为应对这些局限性,我们提出了C$^{2}$TC(面向表格浓缩的类别自适应聚类),这是首个无训练的表格数据集浓缩框架,它联合优化类别分配与特征表示,从而实现高效且可扩展的浓缩。具体而言,我们将数据集浓缩目标重新表述为一个新颖的类别自适应聚类分配问题,该问题消除了成本高昂的训练过程,并集成了自适应标签分配以处理类别不平衡。为解决这一NP难问题,我们开发了HFILS,一种启发式局部搜索方法,通过在软分配和类内聚类之间交替进行,以高效获得高质量解。此外,我们提出了一种混合类别特征编码方法,用于对异构离散属性进行语义保持的聚类。在10个真实世界数据集上的大量实验表明,C$^{2}$TC相较于最先进的基线方法,效率提升至少2个数量级,同时实现了更优的下游性能。