With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at https://huggingface.co/datasets/tianyumyum/AOE.
翻译:随着大型语言模型(LLMs)的出现,人们期望LLMs能够有效地从复杂的现实世界文档(例如论文、报告)中提取显式信息。然而,大多数LLMs生成的是混乱、无序且难以追溯的段落式答案。为了弥合这一差距,我们引入了Arranged and Organized Extraction Benchmark (AOE),这是一个新的双语基准,包含不同长度的数据和文档,旨在系统评估LLMs理解碎片化文档并将孤立信息重构为一个有组织表格的能力。与依赖固定模式和狭窄任务领域的传统文本到表格任务不同,AOE涵盖了三个不同领域的11项精心设计的任务,要求模型根据不同的输入查询生成特定于上下文的模式。在实验中,我们评估了开源和闭源的最先进LLMs。结果表明,即使是最先进的模型也面临显著困难。该基准可在 https://huggingface.co/datasets/tianyumyum/AOE 获取。