Tabular data prediction has been employed in medical applications such as patient health risk prediction. However, existing methods usually revolve around the algorithm design while overlooking the significance of data engineering. Medical tabular datasets frequently exhibit significant heterogeneity across different sources, with limited sample sizes per source. As such, previous predictors are often trained on manually curated small datasets that struggle to generalize across different tabular datasets during inference. This paper proposes to scale medical tabular data predictors (MediTab) to various tabular inputs with varying features. The method uses a data engine that leverages large language models (LLMs) to consolidate tabular samples to overcome the barrier across tables with distinct schema. It also aligns out-domain data with the target task using a "learn, annotate, and refinement" pipeline. The expanded training data then enables the pre-trained MediTab to infer for arbitrary tabular input in the domain without fine-tuning, resulting in significant improvements over supervised baselines: it reaches an average ranking of 1.57 and 1.00 on 7 patient outcome prediction datasets and 3 trial outcome prediction datasets, respectively. In addition, MediTab exhibits impressive zero-shot performances: it outperforms supervised XGBoost models by 8.9% and 17.2% on average in two prediction tasks, respectively.
翻译:表格数据预测已应用于患者健康风险预测等医疗场景。然而,现有方法通常聚焦于算法设计,忽视了数据工程的重要性。医疗表格数据集在不同来源间常呈现显著异质性,且每类来源的样本量有限。因此,以往的预测器通常基于人工精选的小规模数据集训练,在推理阶段难以跨不同表格数据集泛化。本文提出通过数据引擎实现医疗表格数据预测器(MediTab)的规模化,使其能适应具有不同特征的各种表格输入。该方法采用数据引擎,利用大语言模型(LLM)整合表格样本以克服跨异构模式表格的壁垒,并通过"学习-标注-精炼"流程将域外数据与目标任务对齐。扩展后的训练数据使预训练MediTab无需微调即可对域内任意表格输入进行推理,在监督基线方法上取得显著提升:在7个患者结局预测数据集和3个试验结局预测数据集上分别达到1.57和1.00的平均排名。此外,MediTab展现了卓越的零样本性能:在两个预测任务中,其平均表现分别超越监督式XGBoost模型8.9%和17.2%。