Most realistic task automation problems require large language models (LLMs) to call tools, which often return complex JSON responses. These responses must be further processed to derive the information necessary for task completion. The ability of LLMs to do so is under-studied. In this paper, we study the tool response processing task and LLMs' abilities to process structured (JSON) responses. We created a dataset for this task, and evaluated 15 open and closed weight models using multiple prompting approaches. Our results show that JSON processing remains a difficult task even for frontier models across multiple prompting strategies. The optimal response processing strategy depends on both the nature and size of the tool outputs, as well as the complexity of the required reasoning. Variations in processing approaches can lead to performance differences ranging from 3\% to 50\%.
翻译:大多数现实任务自动化问题需要大型语言模型(LLMs)调用工具,这些工具通常返回复杂的JSON响应。这些响应必须经过进一步处理才能提取完成任务所需的信息。LLMs在此方面的能力尚未得到充分研究。本文研究了工具响应处理任务以及LLMs处理结构化(JSON)响应的能力。我们为此任务创建了一个数据集,并使用多种提示方法评估了15个开源和闭源模型。研究结果表明,即使对于前沿模型,JSON处理在多种提示策略下仍是一项困难任务。最优响应处理策略取决于工具输出的性质与规模,以及所需推理的复杂程度。处理方法的差异可能导致3%至50%的性能波动。