In this work, we address question answering (QA) over a hybrid of tabular and textual data that are very common content on the Web (e.g. SEC filings), where discrete reasoning capabilities are often required. Recently, large language models (LLMs) like GPT-4 have demonstrated strong multi-step reasoning capabilities. We then consider harnessing the amazing power of LLMs to solve our task. We abstract a Step-wise Pipeline for tabular and textual QA, which consists of three key steps, including Extractor, Reasoner and Executor, and initially design an instruction to instantiate the pipeline and validate that GPT-4 outperforms all existing methods. However, utilizing an online LLM like GPT-4 holds various challenges in terms of cost, latency, and data security risk, which motivates us to specialize smaller LLMs in this task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets following the Step-wise Pipeline. The experimental results have verified that our TAT-LLM model can outperform all baseline models, including the previous best fine-tuned models and very large-scale LLMs like GPT-4 on FinQA, TAT-QA and TAT-DQA benchmarks.
翻译:本文针对网络常见内容(如SEC文件)中表格与文本混合数据的问答任务,这类任务通常需要离散推理能力。近期,GPT-4等大型语言模型(LLMs)展现出强大的多步推理能力。我们探索利用LLMs的强大能力解决该任务,提出了一种面向表格与文本问答的逐步流水线抽象方法,包含提取器、推理器和执行器三个关键步骤,并初步设计指令实例化该流水线,验证了GPT-4优于现有所有方法。然而,使用GPT-4等在线LLM存在成本、延迟和数据安全风险等挑战,这促使我们针对该任务研制专用小型LLM。我们通过基于逐步流水线从现有专家标注数据集自动生成训练数据,并对LLaMA 2进行微调,开发了TAT-LLM语言模型。实验结果表明,在FinQA、TAT-QA和TAT-DQA基准测试中,我们的TAT-LLM模型优于所有基线模型,包括先前最佳微调模型及GPT-4等超大规模LLM。