In this paper, we explore the question of whether large language models can support cost-efficient information extraction from tables. We introduce schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. We use this collection of annotated tables to evaluate the ability of open-source and API-based language models to extract information from tables covering diverse domains and data formats. Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels, achieving F1 scores ranging from 74.2 to 96.1, while maintaining cost efficiency. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to model success and validate the practicality of distilling compact models to reduce API reliance.
翻译:本文探讨了大型语言模型是否能够支持从表格中进行成本高效的信息抽取。我们提出了模式驱动信息抽取这一新任务,其目标是根据人工编写的模式将表格数据转换为结构化记录。为了评估不同LLM在此任务上的能力,我们构建了一个包含四个不同领域表格的基准测试集:机器学习论文、化学文献、材料科学期刊和网页。我们利用这一批已标注的表格集合,评估了开源和基于API的语言模型从涵盖不同领域和数据格式的表格中抽取信息的能力。实验结果表明,无需特定任务流水线或标注即可实现出人意料的高竞争力性能,F1分数范围在74.2至96.1之间,同时保持了成本效益。此外,通过详细的消融研究和分析,我们探究了影响模型成功的关键因素,并验证了通过蒸馏紧凑模型以减少对API依赖的实用性。