Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.
翻译:现代编程脚手架将大型语言模型转变为具备能力的软件智能体,但其遵循脚手架指定指令的能力仍未得到充分检验,尤其是在约束条件具有异构性且持续存在于交互过程中的情况下。为填补这一空白,我们提出了OctoBench,用于基准测试仓库基础智能体编程中的脚手架感知指令遵循能力。OctoBench包含34个环境与217项任务,这些任务在三种脚手架类型下实例化,并配有7,098项客观检查清单条目。为区分任务解决与规则遵循,我们提供了一个自动化的观察与评分工具包,该工具包能够捕获完整轨迹并执行细粒度检查。在八个代表性模型上的实验揭示了任务解决能力与脚手架感知合规性之间存在系统性差距,这凸显了需要针对异构指令遵循进行专门训练与评估的必要性。我们发布此基准测试以支持可复现的基准评估,并加速更具脚手架感知能力的编程智能体的开发。