Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE-H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi-turn interactions with LLM-simulated human feedback. It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE-H establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation
翻译:大型语言模型(LLM)在支持科学研究实现方面展现出潜力,但其生成正确且可执行代码的能力仍然有限。现有工作大多采用单次生成设定,忽视了科学研究开发实际工作流程中迭代与反馈驱动的本质。为弥补这一不足,我们提出了RECODE-H——一个包含来自研究论文与代码仓库的102项任务的基准,通过多轮交互与LLM模拟的人类反馈来评估LLM智能体。该基准包含结构化指令、单元测试以及五级反馈层次体系,以反映真实的研究者-智能体协作场景。我们进一步提出了ReCodeAgent框架,该框架将反馈机制整合至迭代式代码生成过程中。通过对GPT-5、Claude-Sonnet-4、DeepSeek-V3.1及Gemini 2.5等领先LLM的实验表明,更丰富的反馈能带来显著的性能提升,同时也凸显了复杂研究代码生成领域持续存在的挑战。RECODE-H为开发适应性强、反馈驱动的科学研究实现用LLM智能体奠定了重要基础。