In this paper, we demonstrate a surprising capability of large language models (LLMs): given only input feature names and a description of a prediction task, they are capable of selecting the most predictive features, with performance rivaling the standard tools of data science. Remarkably, these models exhibit this capacity across various query mechanisms. For example, we zero-shot prompt an LLM to output a numerical importance score for a feature (e.g., "blood pressure") in predicting an outcome of interest (e.g., "heart failure"), with no additional context. In particular, we find that the latest models, such as GPT-4, can consistently identify the most predictive features regardless of the query mechanism and across various prompting strategies. We illustrate these findings through extensive experiments on real-world data, where we show that LLM-based feature selection consistently achieves strong performance competitive with data-driven methods such as the LASSO, despite never having looked at the downstream training data. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place. This could potentially benefit practitioners in domains like healthcare, where collecting high-quality data comes at a high cost.
翻译:本文展示了大型语言模型(LLMs)一项令人惊讶的能力:仅给定输入特征名称和预测任务的描述,它们便能选择最具预测性的特征,其性能可与数据科学的标准工具相媲美。值得注意的是,这些模型在不同查询机制下均表现出这种能力。例如,我们在零样本条件下提示LLM为特定特征(如“血压”)对目标结果(如“心力衰竭”)的预测重要性输出数值评分,且不提供任何额外上下文。特别地,我们发现最新模型(如GPT-4)能够无视查询机制的差异,在各种提示策略下始终如一地识别出最具预测性的特征。我们通过对真实世界数据的大量实验验证了这些发现,结果表明:尽管从未接触下游训练数据,基于LLM的特征选择始终能取得与LASSO等数据驱动方法相竞争的优秀性能。我们的研究暗示,LLMs不仅可用于选择最佳训练特征,还可能辅助决策应优先收集哪些特征。这在医疗健康等高成本数据收集领域,或将为从业者带来实际效益。