Autonomous agentic workflows that iteratively refine their own behavior hold considerable promise, yet their failure modes remain poorly characterized. We investigate optimization instability, a phenomenon in which continued autonomous improvement paradoxically degrades classifier performance, using Pythia, an open-source framework for automated prompt optimization. Evaluating three clinical symptoms with varying prevalence (shortness of breath at 23%, chest pain at 12%, and Long COVID brain fog at 3%), we observed that validation sensitivity oscillated between 1.0 and 0.0 across iterations, with severity inversely proportional to class prevalence. At 3% prevalence, the system achieved 95% accuracy while detecting zero positive cases, a failure mode obscured by standard evaluation metrics. We evaluated two interventions: a guiding agent that actively redirected optimization, amplifying overfitting rather than correcting it, and a selector agent that retrospectively identified the best-performing iteration successfully prevented catastrophic failure. With selector agent oversight, the system outperformed expert-curated lexicons on brain fog detection by 331% (F1) and chest pain by 7%, despite requiring only a single natural language term as input. These findings characterize a critical failure mode of autonomous AI systems and demonstrate that retrospective selection outperforms active intervention for stabilization in low-prevalence classification tasks.
翻译:能够迭代优化自身行为的自主代理工作流具有重要应用前景,但其失效模式尚未得到充分研究。本文利用开源自动提示优化框架Pythia,研究了优化不稳定性现象——即持续的自主优化反而导致分类器性能下降的矛盾情况。通过对三种不同患病率的临床症状(气短23%、胸痛12%、长新冠脑雾3%)进行评估,我们发现验证敏感度在迭代过程中在1.0至0.0之间振荡,其振荡严重程度与类别患病率成反比。在3%患病率条件下,系统在检测出零阳性病例的情况下仍能达到95%准确率,这种失效模式被标准评估指标所掩盖。我们评估了两种干预措施:主动引导优化方向的指导代理反而加剧了过拟合而非纠正问题;而通过回顾性选择最佳迭代版本的选择代理则成功避免了灾难性失效。在选择代理监督下,该系统仅需单个自然语言术语作为输入,在脑雾检测的F1分数上超越专家构建词典331%,在胸痛检测上提升7%。这些发现揭示了自主人工智能系统的关键失效模式,并证明在低患病率分类任务中,回顾性选择策略比主动干预更能有效维持系统稳定性。