We present ARES-LSHADE, a memetic differential-evolution variant submitted to the GECCO 2026 competition on LLM-designed evolutionary algorithms for the Generalized Numerical Benchmark Generator (GNBG). The algorithm builds on the LLM-LSHADE 2025 winner, contributing two new components: (a) a scout-augmented mutation operator with adaptive CMA-ES integration, produced by an autonomous research loop across approximately thirty LLM-driven design experiments, and (b) a multi-start L-BFGS-B polish phase that respects strict blackbox treatment of the benchmark. On the official 31-run-per-function evaluation with the competition-specified function-evaluation budgets, ARES-LSHADE obtains 510 of 744 wins (per-function gap below 1e-8), reaching machine precision on 18 of 24 functions. The remaining six functions exhibit characteristic plateau signatures consistent with GNBG's compositional structure, and were independently identified by the autoresearch loop as the hardest of the suite. Beyond the result itself, this report documents two methodological observations: (i) an LLM-driven research loop with operator-only edit surface and fitness-only observation space converges to a characteristic plateau on this benchmark; (ii) when we initially widened the observation space to include the benchmark's compositional metadata, the resulting algorithm trivially solved all 24 functions but violated the competition's blackbox rule, which we identified before submission. We discuss this tension between LLM capability and benchmark integrity as a design consideration for future LLM-driven optimization-algorithm research. Code and reproducibility artifacts are available at https://github.com/anaeem1/ARES-LSHADE.
翻译:我们提出ARES-LSHADE,一种提交至2026年GECCO大会“大语言模型设计进化算法”竞赛中针对广义数值基准测试生成器(GNBG)的模因差分进化变体。该算法基于2025年获胜方案LLM-LSHADE,引入两个新组件:(a) 一种侦察增强型变异算子,结合自适应CMA-ES集成,通过约三十次大语言模型驱动设计实验的自主研究循环生成;(b) 一种多起点L-BFGS-B抛光阶段,严格遵守基准测试的黑箱处理规则。在官方按函数评估31次、采用竞赛指定函数评估预算的测试中,ARES-LSHADE在744个胜场中取得510个(单函数差距低于1e-8),并在24个函数中的18个达到机器精度。其余6个函数呈现与GNBG组合结构一致的典型平台特征,且被自主研究循环独立识别为套件中最难的部分。除结果本身外,本报告记录了两项方法论观察:(i) 仅在算子编辑空间和适应度观察空间运作的大语言模型驱动研究循环,在该基准测试上收敛至特征性平台;(ii) 当我们初步将观察空间扩展至包含基准测试的组合元数据时,生成的算法虽轻松解决所有24个函数,但违反了竞赛的黑箱规则——我们在提交前即识别出此问题。我们讨论了大语言模型能力与基准测试完整性之间的张力,将其作为未来大语言模型驱动优化算法研究的设计考量。代码与可复现性工件见 https://github.com/anaeem1/ARES-LSHADE。