This paper proposes a detailed prompting flow, termed Table-Logic, to investigate the performance contrasts between bigger and smaller language models (LMs) utilizing step-by-step reasoning methods in the TableQA task. The method processes tasks by sequentially identifying critical columns and rows given question and table with its structure, determining necessary aggregations, calculations, or comparisons, and finally inferring the results to generate a precise prediction. By deploying this method, we observe a 7.8% accuracy improvement in bigger LMs like Llama-3-70B compared to the vanilla on HybridQA, while smaller LMs like Llama-2-7B shows an 11% performance decline. We empirically investigate the potential causes of performance contrasts by exploring the capabilities of bigger and smaller LMs from various dimensions in TableQA task. Our findings highlight the limitations of the step-by-step reasoning method in small models and provide potential insights for making improvements.
翻译:本文提出了一种详细的提示流程,称为Table-Logic,旨在研究在TableQA任务中,使用逐步推理方法时大型与小型语言模型之间的性能差异。该方法通过以下步骤处理任务:首先根据问题及表格结构依次识别关键列与行,然后确定必要的聚合、计算或比较操作,最后推断结果以生成精确预测。通过部署此方法,我们观察到在HybridQA数据集上,Llama-3-70B等大型语言模型的准确率相比基础方法提升了7.8%,而Llama-2-7B等小型语言模型的性能则下降了11%。我们通过从多个维度探索大型与小型语言模型在TableQA任务中的能力,对性能差异的潜在原因进行了实证研究。我们的发现凸显了逐步推理方法在小型模型中的局限性,并为改进提供了潜在的见解。