In this work, we address question answering (QA) over a hybrid of tabular and textual data that are very common content on the Web (e.g. SEC filings), where discrete reasoning capabilities are often required. Recently, large language models (LLMs) like GPT-4 have demonstrated strong multi-step reasoning capabilities. We then consider harnessing the amazing power of LLMs to solve our task. We abstract a Step-wise Pipeline for tabular and textual QA, which consists of three key steps, including Extractor, Reasoner and Executor, and initially design an instruction to instantiate the pipeline and validate that GPT-4 outperforms all existing methods. However, utilizing an online LLM like GPT-4 holds various challenges in terms of cost, latency, and data security risk, which motivates us to specialize smaller LLMs in this task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets following the Step-wise Pipeline. The experimental results have verified that our TAT-LLM model can outperform all baseline models, including the previous best fine-tuned models and very large-scale LLMs like GPT-4 on FinQA, TAT-QA and TAT-DQA benchmarks. We hope our work can serve as a pioneering example of specializing smaller language models for specific tasks.
翻译:摘要:本研究针对网页中常见混合表格与文本数据(如SEC文件)的问答任务展开工作,此类任务通常需要离散推理能力。近期,GPT-4等大型语言模型(LLMs)展现出强大的多步推理能力,我们考虑利用这一卓越能力来解决该任务。通过抽象出面向表格与文本问答的逐步流水线(包含提取器、推理器与执行器三个关键步骤),我们初步设计指令实现该流水线,并验证GPT-4优于现有所有方法。然而,使用GPT-4等在线大模型面临成本、延迟及数据安全风险等挑战,这促使我们针对该任务开发专用的小型语言模型。我们基于现有专家标注数据集,遵循逐步流水线自动生成训练数据,通过微调LLaMA 2构建了TAT-LLM语言模型。实验结果表明,在FinQA、TAT-QA及TAT-DQA基准测试中,TAT-LLM模型在性能上超越所有基线模型,包括此前最优的微调模型及GPT-4等超大规模语言模型。我们希望本工作能为面向特定任务开发专用小型语言模型提供开创性范例。