Automated prompt optimization methods (e.g., DSpy, TextGrad) can substantially improve the performance of large language model (LLM), however, their generalization ability across different tasks remains underperformed. In practice, the superiority of the optimized prompt on one benchmark often fails to transfer to another, and this limitation persists even when switching across different LLM backbones. To investigate the underexplored sources of heterogeneity in prompt performance, we conduct a causal inference-inspired observational analysis of optimized prompts across a diverse set of optimization frameworks, LLM backbones, and NLP benchmarks. To achieve the goal, we build upon the propensity-adjusted associational analysis together with multiple complementary representations of prompt edits, where the consistent task-conditioned edits patterns are identified. We find that complexity-increasing and meta-instructional edits are negatively associated with mathematical and multi-hop reasoning performance, whereas step-by-step and meta-cognitive edits improve logical and sequential reasoning tasks. These effects are robust across cognitive-load annotations, surface-level text features, and edit-motif analyses, and can generalize across optimization frameworks. Overall, these results indicate that prompt optimization failures arise from systematic interactions between edit families and task characteristics rather than random optimization artifacts, providing feature-level characterization of optimizer behavior and motivating future task-conditioned optimizer design.
翻译:自动化提示词优化方法(例如DSpy、TextGrad)能显著提升大语言模型的性能,然而,它们在不同任务上的泛化能力仍不尽如人意。实践中,一个优化后的提示词在一个基准上的优势往往无法迁移到另一个基准上,即便切换不同的大语言模型主骨干,这种局限性依然存在。为探究提示词性能中未被充分探索的异质性来源,我们受因果推断启发,对跨多种优化框架、大语言模型主骨干和自然语言处理基准的优化后提示词进行了观测性分析。为实现此目标,我们基于倾向性调整的关联分析,并结合多种提示词编辑的互补表示,识别出了一致的条件性编辑模式。我们发现,复杂度增加和无认知指导的编辑与数学和多跳推理性能呈负相关,而逐步推理和元认知编辑则能提升逻辑和顺序推理任务。这些效应在认知负荷标注、表面文本特征和编辑主题分析中均表现稳健,并能跨优化框架泛化。总体而言,这些结果表明,提示词优化失败源于编辑类别与任务特征之间的系统性相互作用,而非随机的优化伪影,从而为优化器行为提供了特征层面的刻画,并推动了未来基于任务的优化器设计。