Despite the power of Large Language Models (LLMs) like GPT-4, they still struggle with tasks that require generating complex, structured outputs. In this study, we assess the capability of Current LLMs in generating complex structured data and propose a structure-aware fine-tuning approach as a solution to improve this ability. To perform a comprehensive evaluation, we propose Struc-Bench, include five representative LLMs (i.e., GPT-NeoX 20B, GPT-3.5, GPT-4, and Vicuna) and evaluate them on our carefully constructed datasets spanning raw text, HTML, and LaTeX tables. Based on our analysis of current model performance, we identify specific common formatting errors and areas of potential improvement. To address complex formatting requirements, we utilize FormatCoT (Chain-of-Thought) to generate format instructions from target outputs. Our experiments show that our structure-aware fine-tuning method, when applied to LLaMA-7B, significantly improves adherence to natural language constraints, outperforming other evaluated LLMs. Based on these results, we present an ability map of model capabilities from six dimensions (i.e., coverage, formatting, reasoning, comprehension, pragmatics, and hallucination). This map highlights the weaknesses of LLMs in handling complex structured outputs and suggests promising directions for future work. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.
翻译:尽管像GPT-4这样的大型语言模型(LLMs)能力强大,它们在需要生成复杂结构化输出的任务中仍然面临挑战。在本研究中,我们评估了当前LLMs生成复杂结构化数据的能力,并提出了一种结构感知的微调方法作为提升该能力的解决方案。为了进行全面评估,我们提出了Struc-Bench,涵盖了五种代表性LLM(即GPT-NeoX 20B、GPT-3.5、GPT-4和Vicuna),并在我们精心构建的数据集(包括原始文本、HTML和LaTeX表格)上对它们进行评估。基于对当前模型性能的分析,我们识别了特定的常见格式错误和潜在的改进方向。为了应对复杂的格式要求,我们利用FormatCoT(思维链)从目标输出中生成格式指令。实验表明,我们提出的结构感知微调方法应用于LLaMA-7B时,显著提升了其遵循自然语言约束的能力,超越了其他被评估的LLMs。基于这些结果,我们提出了一个从六个维度(即覆盖性、格式性、推理能力、理解能力、语用性和幻觉性)评估模型能力的能力图谱。该图谱揭示了LLMs在处理复杂结构化输出方面的弱点,并为未来研究指明了有前景的方向。我们的代码和模型可在https://github.com/gersteinlab/Struc-Bench获取。