Tabular data comprising rows (samples) with the same set of columns (attributes, is one of the most widely used data-type among various industries, including financial services, health care, research, retail, and logistics, to name a few. Tables are becoming the natural way of storing data among various industries and academia. The data stored in these tables serve as an essential source of information for making various decisions. As computational power and internet connectivity increase, the data stored by these companies grow exponentially, and not only do the databases become vast and challenging to maintain and operate, but the quantity of database tasks also increases. Thus a new line of research work has been started, which applies various learning techniques to support various database tasks for such large and complex tables. In this work, we split the quest of learning on tabular data into two phases: The Classical Learning Phase and The Modern Machine Learning Phase. The classical learning phase consists of the models such as SVMs, linear and logistic regression, and tree-based methods. These models are best suited for small-size tables. However, the number of tasks these models can address is limited to classification and regression. In contrast, the Modern Machine Learning Phase contains models that use deep learning for learning latent space representation of table entities. The objective of this survey is to scrutinize the varied approaches used by practitioners to learn representation for the structured data, and to compare their efficacy.
翻译:表格数据(由相同列集合的行样本构成)是各行业应用最广泛的数据类型之一,涵盖金融服务、医疗健康、科研、零售及物流等领域。表格已成为学术界与工业界存储数据的自然形式,其中存储的数据是决策制定的重要信息来源。随着计算能力与互联网连接性的提升,企业存储的数据呈现指数级增长,数据库不仅变得庞大且难以维护运营,相关任务数量也随之激增。由此催生了新的研究方向:应用各类学习技术来支持大型复杂表格的数据库任务。本文将表格数据学习研究划分为两个阶段:经典学习阶段与现代机器学习阶段。经典学习阶段包含支持向量机、线性/逻辑回归及基于树的方法等模型,这类方法最适用于小规模表格,但能解决的任务类型仅限于分类与回归。现代机器学习阶段则采用深度学习技术学习表格实体的潜在空间表征。本综述旨在系统审视从业者为结构化数据学习表征所采用的不同方法,并比较其有效性。