Self-supervised learning (SSL) has been incorporated into many state-of-the-art models in various domains, where SSL defines pretext tasks based on unlabeled datasets to learn contextualized and robust representations. Recently, SSL has been a new trend in exploring the representation learning capability in the realm of tabular data, which is more challenging due to not having explicit relations for learning descriptive representations. This survey aims to systematically review and summarize the recent progress and challenges of SSL for non-sequential tabular data (SSL4NS-TD). We first present a formal definition of NS-TD and clarify its correlation to related studies. Then, these approaches are categorized into three groups -- predictive learning, contrastive learning, and hybrid learning, with their motivations and strengths of representative methods within each direction. On top of this, application issues of SSL4NS-TD are presented, including automatic data engineering, cross-table transferability, and domain knowledge integration. In addition, we elaborate on existing benchmarks and datasets for NS-TD applications to discuss the performance of existing tabular models. Finally, we discuss the challenges of SSL4NS-TD and provide potential directions for future research. We expect our work to be useful in terms of encouraging more research on lowering the barrier to entry SSL for the tabular domain and improving the foundations for implicit tabular data.
翻译:自监督学习已被广泛应用于多个领域的最先进模型中,其通过基于无标注数据集设计前置任务来学习情境化且鲁棒的表示。近年来,自监督学习在表格数据表征学习能力的探索中成为新趋势,但由于表格数据缺乏用于学习描述性表示的显式关系,该领域面临更大挑战。本综述旨在系统回顾和总结面向非序列表格数据的自监督学习(SSL4NS-TD)的最新进展与挑战。我们首先给出非序列表格数据的正式定义,并阐明其与相关研究的关联。随后,将这些方法分为预测学习、对比学习和混合学习三大类,阐述每类方法中代表性工作的动机与优势。在此基础上,介绍SSL4NS-TD的应用议题,包括自动化数据工程、跨表格迁移性及领域知识融合。此外,我们详细梳理了面向非序列表格数据应用的现有基准测试与数据集,以讨论当前表格模型的性能表现。最后,探讨SSL4NS-TD面临的挑战并指出未来研究方向。我们期望本工作能推动降低表格领域自监督学习的入门门槛,并夯实隐式表格数据的研究基础。