We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity and enables systematic testing across 22 models and five prompting strategies. We further introduce complementary performance metrics that capture both token-level accuracy and document-level validity, facilitating rigorous comparison of model, size, and prompting effects on parsing reliability. In particular, we show that choosing the right prompting strategy is more important than standard attributes such as model size. This especially ensures structural validity for smaller or less reliable models but increase the number of semantic errors. Our benchmark suite is an step towards future research in the area of LLM applied to parsing or Extract, Transform and Load (ETL) applications.
翻译:我们提出了LLMStructBench,这是一个用于评估大型语言模型(LLMs)从自然语言文本中提取结构化数据并生成有效JavaScript对象表示法(JSON)输出的新型基准测试。我们的开放数据集包含多样化、经过人工验证且复杂度各异的解析场景,支持对22种模型和五种提示策略进行系统性测试。我们进一步引入了互补的性能指标,这些指标同时捕捉了词元级准确性和文档级有效性,有助于严格比较模型、规模和提示策略对解析可靠性的影响。特别地,我们发现选择合适的提示策略比模型规模等标准属性更为重要。这尤其确保了较小或可靠性较低模型的结构有效性,但会增加语义错误的数量。我们的基准测试套件是推动未来在LLM应用于解析或提取、转换和加载(ETL)领域研究的重要一步。