In this paper, we explore the question of whether language models (LLMs) can support cost-efficient information extraction from complex tables. We introduce schema-driven information extraction, a new task that uses LLMs to transform tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we develop a benchmark composed of tables from three diverse domains: machine learning papers, chemistry tables, and webpages. Accompanying the benchmark, we present InstrucTE, a table extraction method based on instruction-tuned LLMs. This method necessitates only a human-constructed extraction schema, and incorporates an error-recovery strategy. Notably, InstrucTE demonstrates competitive performance without task-specific labels, achieving an F1 score ranging from 72.3 to 95.7. Moreover, we validate the feasibility of distilling more compact table extraction models to minimize extraction costs and reduce API reliance. This study paves the way for the future development of instruction-following models for cost-efficient table extraction.
翻译:本文探讨了语言模型(LLMs)能否支持从复杂表格中实现经济高效的信息抽取。我们提出了一种新任务——模式驱动信息抽取,该任务利用LLMs将表格数据转换为遵循人类编写模式的结构化记录。为评估不同LLM在此任务上的能力,我们构建了一个包含机器学习论文、化学表格和网页三个不同领域表格的基准测试集。配套该基准测试集,我们提出了InstrucTE——一种基于指令微调LLMs的表格抽取方法。该方法仅需人工构建的抽取模式,并集成了错误恢复策略。值得注意的是,InstrucTE在无需任务特定标注的情况下表现出竞争性性能,F1得分范围为72.3至95.7。此外,我们验证了蒸馏更紧凑的表格抽取模型以最小化抽取成本并降低API依赖的可行性。本研究为未来开发可遵循指令的经济高效表格抽取模型奠定了基础。