Large reasoning models (LRMs) have achieved strong performance enhancement through scaling test time computation, but due to the inherent limitations of the underlying language models, they still have shortcomings in tasks that require precise computation and extensive knowledge reserves. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool call and execution within the reasoning trajectory. Although recent works have released some powerful open-source TIR models, our analysis reveals that these models still suffer from critical deficiencies. We find that when the reasoning of the model conflicts with the tool results, the model tends to believe in its own reasoning. And there are cases where the tool results are correct but are ignored by the model, resulting in incorrect answers, which we define as "Tool Ignored''. This indicates that the model does not know when to trust or ignore the tool. To overcome these limitations, We introduce Adaptive Tool Trust Calibration (ATTC), a novel framework that guides the model to adaptively choose to trust or ignore the tool results based on the confidence score of generated code blocks. The experimental results from various open-source TIR models of different sizes and across multiple datasets demonstrate that ATTC effectively reduces the "Tool Ignored" issue, resulting in a performance increase of 4.1% to 7.5%.
翻译:大型推理模型(LRMs)通过扩展测试时计算实现了显著的性能提升,但由于基础语言模型的固有局限性,在需要精确计算和广泛知识储备的任务中仍存在不足。工具集成推理(TIR)作为一种有前景的范式出现,它在推理轨迹中整合了工具调用与执行。尽管近期研究发布了一些强大的开源TIR模型,但我们的分析表明,这些模型仍存在关键缺陷。我们发现,当模型推理与工具结果发生冲突时,模型倾向于相信自身推理。此外,存在工具结果正确却被模型忽略的情况,导致错误答案,我们将此定义为“工具忽略”。这表明模型不知道何时信任或忽略工具。为克服这些局限性,我们引入了自适应工具信任校准(ATTC),这是一种新颖的框架,引导模型根据生成代码块的置信度分数自适应选择信任或忽略工具结果。来自多个不同规模开源TIR模型及多个数据集的实验结果表明,ATTC有效减少了“工具忽略”问题,带来了4.1%至7.5%的性能提升。