Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.
翻译:现代编程脚手架将大型语言模型转变为强大的软件智能体,但其遵循脚手架指定指令的能力尚未得到充分检验,尤其是在约束条件具有异质性且贯穿多轮交互的情况下。为填补这一空白,我们提出了OctoBench,这是一个用于评估仓库基础智能体编程中脚手架感知指令遵循能力的基准测试。OctoBench包含基于三种脚手架类型实例化的34个环境与217项任务,并配有7,098项客观检查清单条目。为区分任务解决与规则遵循,我们提供了自动化的观测与评分工具包,该工具包能捕获完整交互轨迹并执行细粒度检查。在八个代表性模型上的实验揭示了任务解决能力与脚手架感知合规性之间存在系统性差距,这凸显了需要针对异质指令遵循进行专门训练与评估。我们公开此基准测试以支持可复现的评估,并加速更具脚手架感知能力的编程智能体的开发。