Are LLMs Ready for TOON? Benchmarking Structural Correctness-Sustainability Trade-offs in Novel Structured Output Formats

Large Language Models (LLMs) are increasingly required to generate structured, machine-readable outputs for downstream systems. While recent benchmarks have focused on evaluating the structural correctness of such outputs, the environmental impact of inference for different output formats has largely been overlooked. In this paper, we argue that structured output formats should be assessed not only in terms of correctness, but also with respect to their environmental efficiency. To this end, we introduce a sustainability-aware evaluation framework for structured generation that measures token usage, generation time, and estimated carbon emissions. Within this framework, we propose the Environment-Aware Generation Correctness Score (GCS_env), a unified metric that integrates structural correctness with carbon-aware efficiency. Using this framework, we systematically benchmark the novel TOON format against established representations (JSON, XML, YAML) across multiple LLMs spanning different architectures and parameter scales. Our results reveal a consistent trade-off: TOON yields markedly more compact outputs and lower emissions, but lower structural correctness when models lack native support. We show that increased model capacity reduces this gap and that environment-aware scoring can shift format rankings depending on deployment priorities. highlighting the need for sustainability-inclusive benchmarking and provides empirical evidence that compact representations such as TOON can offer practical advantages in large-scale, carbon-conscious LLM deployments.

翻译：大型语言模型（LLMs）日益需要为下游系统生成结构化的机器可读输出。尽管现有基准测试主要关注评估此类输出的结构正确性，但不同输出格式在推理过程中产生的环境影响却长期被忽视。本文主张结构化输出格式的评估不仅应关注正确性，还需考量其环境效率。为此，我们提出一个可持续性感知的结构化生成评估框架，该框架可量化令牌使用量、生成时间及预估碳排放量。在此框架内，我们设计了环境感知生成正确性分数（GCS_env），这是一个将结构正确性与碳感知效率相统一的综合指标。基于该框架，我们系统性地将新型TOON格式与成熟表示形式（JSON、XML、YAML）在涵盖不同架构与参数规模的多类LLMs中进行基准测试。研究结果揭示了一个持续性权衡：TOON格式能产生显著更紧凑的输出和更低的碳排放，但在模型缺乏原生支持时会导致结构正确性下降。我们发现提升模型容量可缩小这一差距，且环境感知评分会根据部署优先级改变格式排序。本研究强调了可持续性包容性基准测试的必要性，并通过实证表明TOON等紧凑表示形式能够为大规模、具碳意识的大型语言模型部署提供实际优势。