While Small Language Models (SLMs) have demonstrated promising performance on an increasingly wide array of commonsense reasoning benchmarks, current evaluation practices rely almost exclusively on the accuracy of their final answers, neglecting the validity of the reasoning processes that lead to those answers. To address this issue, we introduce ReTraceQA, a novel benchmark that introduces process-level evaluation for commonsense reasoning tasks. Our expert-annotated dataset reveals that in a substantial portion of instances (14-24%), SLMs provide correct final answers despite flawed reasoning processes, suggesting that the capabilities of SLMs are often overestimated by evaluation metrics that focus only on comparing the final answer with the ground truth. Indeed, we show that when employing strong Large Language Models (LLMs) as automated judges for reasoning-aware evaluation rather than answer-only metrics, SLM performance drops significantly across all models and datasets, with scores decreasing by up to 25%.
翻译:尽管小型语言模型(SLMs)在日益增多的常识推理基准测试中展现出令人期待的性能,当前评估实践几乎完全依赖其最终答案的准确性,而忽视了导致这些答案的推理过程的有效性。为解决这一问题,我们提出ReTraceQA——一个为常识推理任务引入过程级评估的新型基准。我们的专家标注数据集显示,在相当一部分案例(14-24%)中,SLMs在推理过程存在缺陷的情况下仍能给出正确的最终答案,这表明仅通过对比最终答案与标准答案的评估指标往往会高估SLMs的实际能力。事实上,我们证明当采用强大的大型语言模型(LLMs)作为推理感知评估的自动评判器而非仅依赖答案匹配指标时,所有模型和数据集上的SLM性能均出现显著下降,评分降幅最高可达25%。