Tabular data is essential for applying machine learning tasks across various industries. However, traditional data processing methods do not fully utilize all the information available in the tables, ignoring important contextual information such as column header descriptions. In addition, pre-processing data into a tabular format can remain a labor-intensive bottleneck in model development. This work introduces TabText, a processing and feature extraction framework that extracts contextual information from tabular data structures. TabText addresses processing difficulties by converting the content into language and utilizing pre-trained large language models (LLMs). We evaluate our framework on nine healthcare prediction tasks ranging from patient discharge, ICU admission, and mortality. We show that 1) applying our TabText framework enables the generation of high-performing and simple machine learning baseline models with minimal data pre-processing, and 2) augmenting pre-processed tabular data with TabText representations improves the average and worst-case AUC performance of standard machine learning models by as much as 6%.
翻译:摘要:表格数据对于跨行业应用机器学习任务至关重要。然而,传统数据处理方法未能充分利用表格中的所有信息,忽略了列标题描述等重要上下文信息。此外,将数据预处理为表格格式可能仍是模型开发中劳动密集型的瓶颈。本研究提出了TabText,一个从表格数据结构中提取上下文信息的处理与特征提取框架。TabText通过将内容转化为语言形式并利用预训练的大型语言模型(LLMs)来解决处理难题。我们在九项医疗预测任务(包括患者出院、重症监护入院和死亡率)上评估了该框架。结果表明:1)应用我们的TabText框架能够在极少数据预处理的情况下生成高性能且简单的机器学习基线模型;2)用TabText表示增强预处理后的表格数据,可将标准机器学习模型的平均和最差情况下的AUC性能提升高达6%。