Tabular data poses unique challenges due to its heterogeneous nature, combining both continuous and categorical variables. Existing approaches often struggle to effectively capture the underlying structure and relationships within such data. We propose GFTab (Geodesic Flow Kernels for Semi- Supervised Learning on Mixed-Variable Tabular Dataset), a semi-supervised framework specifically designed for tabular datasets. GFTab incorporates three key innovations: 1) Variable-specific corruption methods tailored to the distinct properties of continuous and categorical variables, 2) A Geodesic flow kernel based similarity measure to capture geometric changes between corrupted inputs, and 3) Tree-based embedding to leverage hierarchical relationships from available labeled data. To rigorously evaluate GFTab, we curate a comprehensive set of 21 tabular datasets spanning various domains, sizes, and variable compositions. Our experimental results show that GFTab outperforms existing ML/DL models across many of these datasets, particularly in settings with limited labeled data.
翻译:表格数据因其异质性(同时包含连续变量和分类变量)而带来独特的挑战。现有方法往往难以有效捕捉此类数据的内在结构与关联。本文提出GFTab(面向混合变量表格数据上半监督学习的测地流核方法),这是一个专为表格数据集设计的半监督学习框架。GFTab包含三项核心创新:1)针对连续变量与分类变量的不同特性定制的变量特异性扰动方法;2)基于测地流核的相似性度量,用于捕捉扰动输入间的几何变化;3)基于树的嵌入方法,以利用已有标注数据中的层次关系。为系统评估GFTab,我们构建了涵盖多个领域、不同规模及变量构成的21个表格数据集集合。实验结果表明,GFTab在多数数据集上优于现有的机器学习/深度学习模型,尤其在标注数据有限的场景中表现突出。