Self-supervised learning (SSL) has been incorporated into many state-of-the-art models in various domains, where SSL defines pretext tasks based on unlabeled datasets to learn contextualized and robust representations. Recently, SSL has been a new trend in exploring the representation learning capability in the realm of tabular data, which is more challenging due to not having explicit relations for learning descriptive representations. This survey aims to systematically review and summarize the recent progress and challenges of SSL for non-sequential tabular data (SSL4NS-TD). We first present a formal definition of NS-TD and clarify its correlation to related studies. Then, these approaches are categorized into three groups -- predictive learning, contrastive learning, and hybrid learning, with their motivations and strengths of representative methods within each direction. On top of this, application issues of SSL4NS-TD are presented, including automatic data engineering, cross-table transferability, and domain knowledge integration. In addition, we elaborate on existing benchmarks and datasets for NS-TD applications to discuss the performance of existing tabular models. Finally, we discuss the challenges of SSL4NS-TD and provide potential directions for future research. We expect our work to be useful in terms of encouraging more research on lowering the barrier to entry SSL for the tabular domain and improving the foundations for implicit tabular data.
翻译:自监督学习已融入多个领域的最先进模型中,其通过基于无标注数据集定义前置任务来学习上下文相关且鲁棒的表示。近年来,自监督学习在表格数据表示学习领域成为新趋势,但由于缺乏学习描述性表示的显式关系,该领域更具挑战性。本综述旨在系统回顾并总结面向非序列表格数据的自监督学习的近期进展与挑战。我们首先给出非序列表格数据的正式定义,并阐明其与相关研究的关联。随后,将现有方法分为三大类——预测学习、对比学习和混合学习,并阐释每个方向中代表性方法的动机与优势。在此基础上,呈现非序列表格数据自监督学习的应用问题,包括自动数据工程、跨表迁移能力和领域知识融合。此外,我们详细阐述了非序列表格数据应用的现有基准与数据集,以讨论现有表格模型的性能。最后,探讨非序列表格数据自监督学习的挑战,并指出未来研究的潜在方向。期望本研究有助于推动降低表格领域自监督学习的应用门槛,并夯实隐式表格数据的研究基础。