Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.
翻译:针对大语言模型训练的强化学习流程常依赖各阶段间人工重新设计环境,需要实践者启发式推断何种配置能最佳改进当前策略。为自动化该过程,我们提出"大语言模型即环境工程师"框架,其中当前策略模型分析失败轨迹与上下文信息,并提出下一阶段训练环境配置的修改方案。同时引入MAPF-FrozenLake——一个可控测试平台,其生成器暴露多维环境配置,适合研究与基准测试环境重设计。在该平台上,我们让环境工程师基于策略行为的结构化摘要、失败案例及环境统计信息,生成下一训练阶段的配置。以Qwen3-4B为骨干网络,我们的框架在基准测试中取得最强综合性能,超越更大规模的专有大语言模型(如GPT、Gemini)及固定环境训练基线。进一步分析发现,最有效的上下文形式需依赖失败证据并保留已奏效的配置。有趣的是,当前强化学习检查点作为环境工程师的表现优于原始基础模型,表明策略学习提升了模型诊断自身剩余弱点的能力。