Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

Large Language Model (LLM) agents demonstrate strong performance in autonomous code generation under loose specifications. However, production-grade software requires strict adherence to structural constraints, such as architectural patterns, databases, and object-relational mappings. Existing benchmarks often overlook these non-functional requirements, rewarding functionally correct but structurally arbitrary solutions. We present a systematic study evaluating how well agents handle structural constraints in multi-file backend generation. By fixing a unified API contract across 80 greenfield generation tasks and 20 feature-implementation tasks spanning eight web frameworks, we isolate the effect of structural complexity using a dual evaluation with end-to-end behavioral tests and static verifiers. Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline. Capable configurations lose 30 points on average in assertion pass rates from baseline to fully specified tasks, while some weaker configurations approach zero. Framework sensitivity analysis exposes significant performance disparities: agents succeed in minimal, explicit frameworks (e.g., Flask) but perform substantially worse on average in convention-heavy environments (e.g., FastAPI, Django). Finally, error analysis identifies data-layer defects (e.g., incorrect query composition and ORM runtime violations) as the leading root causes. This work highlights that jointly satisfying functional and structural requirements remains a key open challenge for coding agents.

翻译：大语言模型（LLM）智能体在宽松规范下展现出强大的自主代码生成能力。然而，生产级软件要求严格遵循结构性约束，例如架构模式、数据库及对象关系映射。现有基准测试往往忽略这些非功能性需求，奖励功能正确但结构随意的解决方案。我们开展了一项系统性研究，评估智能体在多文件后端生成中处理结构性约束的能力。通过固定跨80项绿地生成任务和20项涉及八个Web框架的特征实现任务的统一API契约，我们采用端到端行为测试与静态验证器的双重评估方式，隔离出结构复杂性的影响。研究发现揭示了一种“约束衰减”现象：随着结构性需求累积，智能体性能出现显著下降。从基线任务到完全指定任务，高配置智能体断言通过率平均下降30个百分点，而部分弱配置智能体则趋近于零。框架敏感性分析暴露了显著的性能差异：智能体在最小显式框架（如Flask）中表现优异，但在惯例密集型环境（如FastAPI、Django）中平均表现大幅下滑。最终，错误分析指出数据层缺陷（如错误查询组合与ORM运行时违规）是首要根源。本工作表明，同时满足功能性与结构性需求仍是编码智能体面临的关键开放挑战。