Learning with few labeled tabular samples is often an essential requirement for industrial machine learning applications as varieties of tabular data suffer from high annotation costs or have difficulties in collecting new samples for novel tasks. Despite the utter importance, such a problem is quite under-explored in the field of tabular learning, and existing few-shot learning schemes from other domains are not straightforward to apply, mainly due to the heterogeneous characteristics of tabular data. In this paper, we propose a simple yet effective framework for few-shot semi-supervised tabular learning, coined Self-generated Tasks from UNlabeled Tables (STUNT). Our key idea is to self-generate diverse few-shot tasks by treating randomly chosen columns as a target label. We then employ a meta-learning scheme to learn generalizable knowledge with the constructed tasks. Moreover, we introduce an unsupervised validation scheme for hyperparameter search (and early stopping) by generating a pseudo-validation set using STUNT from unlabeled data. Our experimental results demonstrate that our simple framework brings significant performance gain under various tabular few-shot learning benchmarks, compared to prior semi- and self-supervised baselines. Code is available at https://github.com/jaehyun513/STUNT.
翻译:在工业机器学习应用中,处理少量标注表格样本往往是关键需求,因为各类表格数据常面临标注成本高昂或难以针对新任务收集样本的困境。尽管该问题至关重要,但在表格学习领域尚未得到充分探索,且其他领域的现有少样本学习方案因表格数据的异质性特征而难以直接应用。本文提出一种简洁而有效的少样本半监督表格学习框架——基于未标注表格的自生成任务(STUNT)。其核心思想是通过随机选择列作为目标标签,自生成多样化的少样本任务,并采用元学习机制从构建任务中学习可泛化知识。此外,我们引入一种无监督验证方案用于超参数搜索(及早停),该方案利用STUNT从无标注数据生成伪验证集。实验结果表明,相较于先前的半监督与自监督基线方法,本框架在多种表格少样本学习基准上均能显著提升性能。代码已开源至https://github.com/jaehyun513/STUNT。