Empirical Study for Structured Output Control in LLMs for Software Engineering

LLM-generated outputs in software engineering rarely exist in isolation. They must plug into toolchains, APIs, and data pipelines that impose strict, often organization-specific structural contracts. A semantically correct output that violates the expected format is, from the consuming system's perspective, indistinguishable from a wrong answer, making structural fidelity an operational prerequisite for deploying LLMs in practice. Yet current models routinely produce syntactically invalid or structurally non-compliant outputs. Unlike encoders, autoregressive decoders generate text token-by-token with a local rather than global focus, amplifying structural fragility whenever the target format deviates from familiar training distributions. We present a systematic evaluation of structural reliability across four representative SE tasks, categorizing failures into syntax, structural, and semantic errors. We benchmark ways of mitigation targeting the decoder: grammar-constrained decoding, regex-based validation, and a strict template-driven control (Template Token Match Generation, TTMG) to isolate the sources of these failures. TTMG nearly eliminates syntax errors, yet substantial structural and semantic errors persist, demonstrating that the core bottleneck lies beyond syntax formatting. A detailed case study further illustrates how residual errors cascade in downstream workflows. Our findings show that current structure-enforcing tools are necessary but insufficient, and highlight the need for approaches that jointly ensure structural fidelity and semantic correctness in LLM-driven workflows.

翻译：LLM生成的软件工程输出很少独立存在。它们必须接入工具链、API和数据管道，而这些组件施加了严格且通常特定于组织的结构性契约。一个语义正确但违反预期格式的输出，从消费系统的角度来看，与错误答案并无区别，这使得结构保真度成为实践中部署LLM的操作性前提。然而，当前模型经常产生句法无效或结构不合规的输出。与编码器不同，自回归解码器以局部而非全局焦点逐个生成文本标记，每当目标格式偏离熟悉的训练分布时，会放大结构脆弱性。我们提出了跨四个代表性软件工程任务的结构可靠性系统评估，将故障分类为句法、结构和语义错误。我们基准测试了针对解码器的缓解方式：语法约束解码、基于正则表达式的验证，以及严格的模板驱动控制（模板标记匹配生成，TTMG），以隔离这些故障的根源。TTMG几乎消除了句法错误，但大量结构和语义错误仍然存在，表明核心瓶颈超出句法格式。一个详细的案例研究进一步说明了残余错误如何在下游工作流中级联。我们的研究结果表明，当前的结构强制工具是必要但不充分的，并强调了在LLM驱动的工作流中需要同时确保结构保真度和语义正确性的方法。