Instruction optimization provides a lightweight, model-agnostic approach to enhancing the reasoning performance of large language models (LLMs). This paper presents the first systematic comparison of instruction optimization, based on the DSPy optimization framework, for tabular fact verification. We evaluate four out-of-the-box prompting techniques that cover both text-only prompting and code use: direct prediction, Chain-of-Thought (CoT), ReAct with SQL tools, and CodeAct with Python execution. We study three optimizers from the DSPy framework -- COPRO, MiPROv2, and SIMBA -- across four benchmarks and three model families. We find that instruction optimization consistently improves verification accuracy, with MiPROv2 yielding the most stable gains for CoT, and SIMBA providing the largest benefits for ReAct agents, particularly at larger model scales. Behavioral analyses reveal that SIMBA encourages more direct reasoning paths by applying heuristics, thereby improving numerical comparison abilities in CoT reasoning and helping avoid unnecessary tool calls in ReAct agents. Across different prompting techniques, CoT remains effective for tabular fact checking, especially with smaller models. Although ReAct agents built with larger models can achieve competitive performance, they require careful instruction optimization.
翻译:指令优化为提升大型语言模型(LLMs)的推理性能提供了一种轻量级、模型无关的方法。本文基于DSPy优化框架,首次系统比较了指令优化在表格事实核查任务中的应用效果。我们评估了四种开箱即用的提示技术,涵盖纯文本提示和代码使用场景:直接预测、思维链(CoT)、结合SQL工具的ReAct以及支持Python执行的CodeAct。我们在四个基准数据集和三个模型家族上,对DSPy框架中的三种优化器——COPRO、MiPROv2和SIMBA——进行了全面研究。研究发现,指令优化能持续提升核查准确率:MiPROv2在CoT方法中带来最稳定的增益,而SIMBA对ReAct智能体提升最为显著,尤其在较大规模模型上表现突出。行为分析表明,SIMBA通过应用启发式策略鼓励更直接的推理路径,从而提升CoT推理中的数值比较能力,并帮助ReAct智能体避免不必要的工具调用。在不同提示技术中,CoT对表格事实核查始终保持有效性,尤其在较小模型上表现优异。尽管基于较大模型构建的ReAct智能体能达到有竞争力的性能,但其需要精细的指令优化。