As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and \textbf{2)} conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps-even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.
翻译:随着大语言模型(LLMs)日益融入软件开发工作流,其生成结构化输出的能力变得至关重要。本文提出StructEval——一个用于评估LLMs生成不可渲染(JSON、YAML、CSV)与可渲染(HTML、React、SVG)结构化格式能力的综合性基准。与现有基准不同,StructEval通过两种范式系统评估跨多样格式的结构保真度:1)生成任务:根据自然语言提示生成结构化输出;\textbf{2)} 转换任务:在不同结构化格式间进行转译。本基准涵盖18种格式与44类任务,并提出了格式遵循度与结构正确性的新型评估指标。实验结果表明存在显著的性能差距——即使是o1-mini等前沿模型的平均得分仅为75.58,开源模型则落后约10个百分点。研究发现生成任务比转换任务更具挑战性,而生成正确的可视化内容比生成纯文本结构更为困难。