This work studies the applicability of automatic AI agent optimization methods to real-world agents in formal verification settings, focusing on automated theorem proving in Rocq as a representative and challenging domain. We evaluate how different automatic agent optimizers perform when applied to the task of optimizing a Rocq proof-generation agent, and assess whether parts of the fine-grained tuning of agentic systems, such as prompt design, contextual knowledge, and control strategies, can be automated. Our results show that while several optimizers yield measurable improvements, simple few-shot bootstrapping is the most consistently effective; however, none of the studied methods matches the performance of a carefully engineered state-of-the-art proof agent.
翻译:本研究探讨了自动AI智能体优化方法在形式化验证场景中实际智能体的适用性,聚焦于以Rocq自动定理证明为代表的挑战性领域。我们评估了不同自动智能体优化器在优化Rocq证明生成智能体任务中的表现,并检验智能体系统中提示设计、上下文知识与控制策略等细粒度调优环节能否实现自动化。实验结果表明:尽管多种优化器能带来可测量的性能提升,但简单的少样本引导法具有最稳定的有效性;然而,所有被研究方法的性能均未能超越精心设计的最先进证明智能体。