Classical optimization algorithms--hill climbing, simulated annealing, population-based methods--generate candidate solutions via random perturbations. We replace the random proposal generator with an LLM agent that reasons about evaluation diagnostics to propose informed candidates, and ask: does the classical optimization machinery still help when the proposer is no longer random? We evaluate on four tasks spanning discrete, mixed, and continuous search spaces (all replicated across 3 independent runs): rule-based classification on Breast Cancer (test accuracy 86.0% to 96.5%), mixed hyperparameter optimization for MobileNetV3-Small on STL-10 (84.5% to 85.8%, zero catastrophic failures vs. 60% for random search), LoRA fine-tuning of Qwen2.5-0.5B on SST-2 (89.5% to 92.7%, matching Optuna TPE with 2x efficiency), and XGBoost on Adult Census (AUC 0.9297 to 0.9317, tying CMA-ES with 3x fewer evaluations). Empirically, on these tasks: a cross-task ablation shows that simulated annealing, parallel investigators, and even a second LLM model (OpenAI Codex) provide no benefit over greedy hill climbing while requiring 2-3x more evaluations. In our setting, the LLM's learned prior appears strong enough that acceptance-rule sophistication has limited impact--round 1 alone delivers the majority of improvement, and variants converge to similar configurations across strategies. The practical implication is surprising simplicity: greedy hill climbing with early stopping is a strong default. Beyond accuracy, the framework produces human-interpretable artifacts--the discovered cancer classification rules independently recapitulate established cytopathology principles.
翻译:经典优化算法——爬山法、模拟退火、种群方法——通过随机扰动生成候选解。我们用大语言模型智能体替换随机提议生成器,该智能体基于评估诊断提出有依据的候选解,并探究:当提议者不再是随机的,经典优化机制是否仍有帮助?我们在四个任务上进行了评估,涵盖离散、混合和连续搜索空间(所有任务在3次独立运行中重复):乳腺癌上的基于规则的分类(测试准确率从86.0%提升至96.5%)、STL-10上MobileNetV3-Small的混合超参数优化(准确率从84.5%提升至85.8%,零灾难性失败,而随机搜索为60%)、SST-2上Qwen2.5-0.5B的LoRA微调(准确率从89.5%提升至92.7%,匹配Optuna TPE且效率提升2倍),以及成人普查数据集上的XGBoost(AUC从0.9297提升至0.9317,与CMA-ES持平但评估次数减少3倍)。经验上,在这些任务中:跨任务消融实验表明,模拟退火、并行搜索策略,甚至第二个大语言模型(OpenAI Codex)相比贪心爬山法均无优势,反而需要2-3倍更多的评估。在我们的设置下,大语言模型学习到的先验足够强大,以至于接受规则的复杂程度影响有限——仅第一轮就贡献了大部分改进,且不同策略下的变体收敛到相似配置。实际启示出奇简单:带早停的贪心爬山法是强默认方案。除了准确性,该框架还产生人类可解释的产物——发现的癌症分类规则独立复现了既定的细胞病理学原理。