Greedy Is a Strong Default: Agents as Iterative Optimizers

Classical optimization algorithms--hill climbing, simulated annealing, population-based methods--generate candidate solutions via random perturbations. We replace the random proposal generator with an LLM agent that reasons about evaluation diagnostics to propose informed candidates, and ask: does the classical optimization machinery still help when the proposer is no longer random? We evaluate on four tasks spanning discrete, mixed, and continuous search spaces (all replicated across 3 independent runs): rule-based classification on Breast Cancer (test accuracy 86.0% to 96.5%), mixed hyperparameter optimization for MobileNetV3-Small on STL-10 (84.5% to 85.8%, zero catastrophic failures vs. 60% for random search), LoRA fine-tuning of Qwen2.5-0.5B on SST-2 (89.5% to 92.7%, matching Optuna TPE with 2x efficiency), and XGBoost on Adult Census (AUC 0.9297 to 0.9317, tying CMA-ES with 3x fewer evaluations). Empirically, on these tasks: a cross-task ablation shows that simulated annealing, parallel investigators, and even a second LLM model (OpenAI Codex) provide no benefit over greedy hill climbing while requiring 2-3x more evaluations. In our setting, the LLM's learned prior appears strong enough that acceptance-rule sophistication has limited impact--round 1 alone delivers the majority of improvement, and variants converge to similar configurations across strategies. The practical implication is surprising simplicity: greedy hill climbing with early stopping is a strong default. Beyond accuracy, the framework produces human-interpretable artifacts--the discovered cancer classification rules independently recapitulate established cytopathology principles.

翻译：经典优化算法——爬山法、模拟退火、种群方法——通过随机扰动生成候选解。我们用大语言模型智能体替换随机提议生成器，该智能体基于评估诊断提出有依据的候选解，并探究：当提议者不再是随机的，经典优化机制是否仍有帮助？我们在四个任务上进行了评估，涵盖离散、混合和连续搜索空间（所有任务在3次独立运行中重复）：乳腺癌上的基于规则的分类（测试准确率从86.0%提升至96.5%）、STL-10上MobileNetV3-Small的混合超参数优化（准确率从84.5%提升至85.8%，零灾难性失败，而随机搜索为60%）、SST-2上Qwen2.5-0.5B的LoRA微调（准确率从89.5%提升至92.7%，匹配Optuna TPE且效率提升2倍），以及成人普查数据集上的XGBoost（AUC从0.9297提升至0.9317，与CMA-ES持平但评估次数减少3倍）。经验上，在这些任务中：跨任务消融实验表明，模拟退火、并行搜索策略，甚至第二个大语言模型（OpenAI Codex）相比贪心爬山法均无优势，反而需要2-3倍更多的评估。在我们的设置下，大语言模型学习到的先验足够强大，以至于接受规则的复杂程度影响有限——仅第一轮就贡献了大部分改进，且不同策略下的变体收敛到相似配置。实际启示出奇简单：带早停的贪心爬山法是强默认方案。除了准确性，该框架还产生人类可解释的产物——发现的癌症分类规则独立复现了既定的细胞病理学原理。