Optimization benchmarks play a fundamental role in assessing algorithm performance; however, existing artificial benchmarks often fail to capture the diversity and irregularity of real-world problem structures, while benchmarks derived from real-world problems are costly and difficult to construct. To address these challenges, we propose an evolutionary automatic benchmark generation framework that leverages a large language model (LLM) as a generative operator, termed the LLM-driven evolutionary benchmark generator (LLM-EBG). In this framework, the LLM serves as an evolutionary operator that generates and evolves benchmark problems within a flexible, expressive representation space. As a case study, we generate unconstrained single-objective continuous minimization problems represented as mathematical expressions designed to induce significant performance differences between a genetic algorithm (GA) and differential evolution (DE). Experimental results show that LLM-EBG successfully produces benchmark problems in which the designated target algorithm consistently outperforms the comparative algorithm in more than 80\% of trials. Furthermore, exploratory landscape analysis reveals that benchmarks favoring GA are highly sensitive to variable scaling, demonstrating that the proposed framework can generate problems with distinct geometric characteristics that reflect the intrinsic search behaviors of different optimization algorithms.
翻译:优化基准测试在评估算法性能方面起着基础性作用;然而,现有的人工基准测试往往无法捕捉现实世界问题结构的多样性和不规则性,而源自现实世界问题的基准测试则成本高昂且难以构建。为应对这些挑战,我们提出了一种进化式自动基准测试生成框架,该框架利用大语言模型作为生成算子,称为LLM驱动的进化基准测试生成器。在此框架中,LLM作为一个进化算子,在灵活、富有表达力的表示空间中生成并演化基准测试问题。作为案例研究,我们生成了表示为数学表达式的无约束单目标连续最小化问题,这些问题旨在引发遗传算法与差分进化算法之间显著的性能差异。实验结果表明,LLM-EBG成功生成的基准测试问题中,指定目标算法在超过80%的试验中持续优于对比算法。此外,探索性景观分析表明,有利于GA的基准测试对变量缩放高度敏感,这证明所提框架能够生成具有不同几何特征的问题,这些特征反映了不同优化算法固有的搜索行为。