Feature transformation involves generating a new set of features from the original dataset to enhance the data's utility. In certain domains like material performance screening, dimensionality is large and collecting labels is expensive and lengthy. It highly necessitates transforming feature spaces efficiently and without supervision to enhance data readiness and AI utility. However, existing methods fall short in efficient navigation of a vast space of feature combinations, and are mostly designed for supervised settings. To fill this gap, our unique perspective is to leverage a generator-critic duet-play teaming framework using LLM agents and in-context learning to derive pseudo-supervision from unsupervised data. The framework consists of three interconnected steps: (1) Critic agent diagnoses data to generate actionable advice, (2) Generator agent produces tokenized feature transformations guided by the critic's advice, and (3) Iterative refinement ensures continuous improvement through feedback between agents. The generator-critic framework can be generalized to human-agent collaborative generation, by replacing the critic agent with human experts. Extensive experiments demonstrate that the proposed framework outperforms even supervised baselines in feature transformation efficiency, robustness, and practical applicability across diverse datasets.
翻译:特征变换旨在从原始数据集中生成新特征集以提升数据效用。在材料性能筛选等领域,数据维度高且标注收集成本昂贵、周期漫长,亟需高效且无需监督的特征空间变换方法来提升数据就绪度与人工智能效用。然而,现有方法难以在庞大的特征组合空间中进行高效探索,且大多针对监督场景设计。为填补这一空白,本研究提出创新视角:利用大语言模型智能体与上下文学习构建生成器-评判器二重奏协作框架,从无监督数据中推导伪监督信号。该框架包含三个相互关联的步骤:(1) 评判器智能体通过数据诊断生成可执行的改进建议;(2) 生成器智能体依据评判器建议生成符号化特征变换方案;(3) 通过智能体间的反馈循环进行迭代优化,实现持续改进。该生成器-评判器框架可泛化至人机协同生成场景,即将评判器智能体替换为人类专家。大量实验表明,所提框架在特征变换效率、鲁棒性及跨数据集实际适用性方面均优于现有监督基线方法。