Scientific algorithm discovery is iterative: hypotheses are proposed, implemented, stress-tested, and revised. Current LLM-guided search systems accelerate proposal generation, but often under-represent scientific structure by optimizing code-only artifacts with weak correctness/originality gating. We present CliffSearch, an agentic evolutionary framework in which the core evolution operators (pair selection, crossover, mutation, and review) are implemented as LLM agents, and the loop is designed around three principles: (1) each node is a structured scientific artifact, instantiated in either theory+code or code_only mode, (2) reviewer judgments of correctness and originality are first-class selection gates alongside optimization of the benchmark metric of interest, and (3) mutation is split into exploration and correction pathways with distinct objectives. Exploration mutation imports ideas from adjacent scientific domains to increase novelty, while correction mutation performs targeted evidence-guided repair using reviewer signals over theory, code, benchmark results, and runtime errors. We illustrate the framework on three benchmark-grounded studies: transformer hyper-connection evolution, optimizer discovery on a fixed nanoGPT stack, and a smaller native-optimizer ablation. Across these settings, the same loop supports explicit metric direction, reproducible persistence, and reviewer-gated comparison of discoveries under controlled search conditions. The result is a discovery workflow that prioritizes scientific interpretability and correctness while optimizing task metrics under controlled novelty constraints, rather than maximizing candidate throughput alone. Full run artifacts, interactive visualizations, and exported best nodes for the reported studies are available at https://cliffsearch.ai .
翻译:科学算法发现是一个迭代过程:假设被提出、实现、压力测试并修正。当前基于大语言模型的搜索系统加速了提议生成,但常因仅优化代码产物且缺乏严格的正确性/新颖性筛选而导致科学结构表征不足。我们提出CliffSearch,一种智能体进化框架,其中核心进化算子(配对选择、交叉、变异与评审)由大语言模型智能体实现,循环设计基于三个原则:(1)每个节点均为结构化科学产物,以"理论+代码"或"仅代码"模式实例化;(2)评审者对正确性与新颖性的判断与基准指标优化共同构成一级选择门控;(3)变异被划分为探索路径与修正路径,各自具有明确目标。探索变异通过从相邻科学领域引入思想提升新颖性,而修正变异则基于评审者对理论、代码、基准结果及运行时错误提供的信号进行靶向证据引导修复。我们在三个基准研究中演示该框架:Transformer超连接进化、固定nanoGPT框架上的优化器发现,以及轻量级原生优化器消融实验。在这些设置下,同一循环支持明确的指标导向、可复现的持久化存储,以及受控搜索条件下基于评审门控的发现对比。最终形成的工作流程优先保障科学可解释性与正确性,同时在受控新颖性约束下优化任务指标——而非单纯追求候选吞吐量的最大化。上述研究的完整运行产物、交互式可视化结果及最优节点导出数据,均可在https://cliffsearch.ai获取。