Given the substantial volumes of structured data held by many companies, enabling Large Language Models (LLMs) to directly understand structured text in non-structured forms could significantly enhance their capabilities across various business scenarios. To this end, we propose evaluation data generation method for assessing LLM's ability in understanding the structure-rich text, which generates structured data of controllable complexity based on manually crafted question templates and generation rules. Building on this generation method, we introduce StrucText-Eval, a benchmark comprising 6,032 questions across 8 different structured languages and 29 specific tasks. Furthermore, considering human proficiency in rule-based tasks, we also present StrucText-Eval-Hard, which includes 3,016 questions designed to further examine the gap between LLMs and human performance. Results indicate that the best-performing LLM currently achieve an accuracy of 65.0\% on StrucText-Eval-Hard, while human accuracy reaches up to 95.7\%. Moreover, while fine-tuning using StrucText-Eval can enhance existing LLMs' understanding of all structured languages, it does not necessarily improve performance across all task types. The benchmark and generation codes are open sourced in https://github.com/MikeGu721/StrucText-Eval
翻译:鉴于许多企业持有大量结构化数据,使大语言模型能够直接理解非结构化形式的结构化文本,可显著提升其在各类业务场景中的能力。为此,我们提出一种用于评估大语言模型理解结构丰富文本能力的评测数据生成方法,该方法基于人工构建的问题模板与生成规则,生成复杂度可控的结构化数据。基于此生成方法,我们推出了StrucText-Eval基准,涵盖8种不同结构化语言和29项具体任务,共包含6,032个问题。此外,考虑到人类在基于规则任务中的熟练表现,我们还提出了StrucText-Eval-Hard基准,包含3,016个问题,旨在进一步考察大语言模型与人类表现之间的差距。实验结果表明,当前性能最佳的大语言模型在StrucText-Eval-Hard上的准确率为65.0%,而人类准确率最高可达95.7%。此外,虽然使用StrucText-Eval进行微调可以增强现有大语言模型对所有结构化语言的理解能力,但未必能提升其在所有任务类型上的表现。本基准及生成代码已开源:https://github.com/MikeGu721/StrucText-Eval