Feature extraction from unstructured text is a critical step in many downstream classification pipelines, yet current approaches largely rely on hand-crafted prompts or fixed feature schemas. We formulate feature discovery as a dataset-level prompt optimization problem: given a labelled text corpus, the goal is to induce a global set of interpretable and discriminative feature definitions whose realizations optimize a downstream supervised learning objective. To this end, we propose a multi-agent prompt optimization framework in which language-model agents jointly propose feature definitions, extract feature values, and evaluate feature quality using dataset-level performance and interpretability feedback. Instruction prompts are iteratively refined based on this structured feedback, enabling optimization over prompts that induce shared feature sets rather than per-example predictions. This formulation departs from prior prompt optimization methods that rely on per-sample supervision and provides a principled mechanism for automatic feature discovery from unstructured text.
翻译:从非结构化文本中提取特征是许多下游分类流程的关键步骤,然而当前方法主要依赖人工设计的提示或固定的特征模式。我们将特征发现形式化为数据集级的提示优化问题:给定一个带标签的文本语料库,目标是归纳出一组全局可解释且具有判别性的特征定义,其特征实现能够优化下游监督学习目标。为此,我们提出了一种多智能体提示优化框架,其中语言模型智能体协同提出特征定义、提取特征值,并利用数据集级性能和可解释性反馈评估特征质量。指令提示基于这种结构化反馈进行迭代优化,从而能够对诱导共享特征集而非逐样本预测的提示进行优化。该形式化方法区别于以往依赖逐样本监督的提示优化方法,为从非结构化文本中自动发现特征提供了机制化的解决路径。