In this paper, we explore the question of whether large language models can support cost-efficient information extraction from tables. We introduce schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we develop a benchmark composed of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. Alongside the benchmark, we present an extraction method based on instruction-tuned LLMs. Our approach shows competitive performance without task-specific labels, achieving F1 scores ranging from 74.2 to 96.1, while maintaining great cost efficiency. Moreover, we validate the possibility of distilling compact table-extraction models to reduce API reliance, as well as extraction from image tables using multi-modal models. By developing a benchmark and demonstrating the feasibility of this task using proprietary models, we aim to support future work on open-source schema-driven IE models.
翻译:本文探讨了大型语言模型能否支持从表格中进行低成本信息抽取的问题。我们提出了模式驱动的信息抽取这一新任务,旨在将表格数据按照人工编写的模式转化为结构化记录。为评估各类大语言模型在该任务上的能力,我们构建了一个包含四个不同领域表格的基准测试集:机器学习论文、化学文献、材料科学期刊及网页内容。同时,我们提出了一种基于指令微调大语言模型的抽取方法。该方法无需任务特定标注数据即可展现竞争性表现,F1值达74.2至96.1,且保持极佳成本效益。此外,我们验证了通过蒸馏轻量化表格抽取模型以减少API依赖的可行性,以及利用多模态模型从图像表格中进行抽取的潜力。通过开发基准测试集并利用商业模型证明该任务的可行性,我们旨在为未来开源模式驱动信息抽取模型的研究提供支持。