LLM-Select: Feature Selection with Large Language Models

In this paper, we demonstrate a surprising capability of large language models (LLMs): given only input feature names and a description of a prediction task, they are capable of selecting the most predictive features, with performance rivaling the standard tools of data science. Remarkably, these models exhibit this capacity across various query mechanisms. For example, we zero-shot prompt an LLM to output a numerical importance score for a feature (e.g., "blood pressure") in predicting an outcome of interest (e.g., "heart failure"), with no additional context. In particular, we find that the latest models, such as GPT-4, can consistently identify the most predictive features regardless of the query mechanism and across various prompting strategies. We illustrate these findings through extensive experiments on real-world data, where we show that LLM-based feature selection consistently achieves strong performance competitive with data-driven methods such as the LASSO, despite never having looked at the downstream training data. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place. This could potentially benefit practitioners in domains like healthcare, where collecting high-quality data comes at a high cost.

翻译：本文展示了大型语言模型（LLMs）一项令人惊讶的能力：仅给定输入特征名称和预测任务的描述，它们便能选择最具预测性的特征，其性能可与数据科学的标准工具相媲美。值得注意的是，这些模型在不同查询机制下均表现出这种能力。例如，我们在零样本条件下提示LLM为特定特征（如“血压”）对目标结果（如“心力衰竭”）的预测重要性输出数值评分，且不提供任何额外上下文。特别地，我们发现最新模型（如GPT-4）能够无视查询机制的差异，在各种提示策略下始终如一地识别出最具预测性的特征。我们通过对真实世界数据的大量实验验证了这些发现，结果表明：尽管从未接触下游训练数据，基于LLM的特征选择始终能取得与LASSO等数据驱动方法相竞争的优秀性能。我们的研究暗示，LLMs不仅可用于选择最佳训练特征，还可能辅助决策应优先收集哪些特征。这在医疗健康等高成本数据收集领域，或将为从业者带来实际效益。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日