While Small Language Models (SLMs) have demonstrated promising performance on an increasingly wide array of commonsense reasoning benchmarks, current evaluation practices rely almost exclusively on the accuracy of their final answers, neglecting the validity of the reasoning processes that lead to those answers. To address this issue, we present ReTraceQA, a novel benchmark that introduces process-level evaluation for commonsense reasoning tasks. Our expert-annotated dataset reveals that in a substantial portion of instances (14-24%), SLMs provide correct final answers despite flawed reasoning processes, suggesting that the capabilities of SLMs are often overestimated by evaluation metrics that focus only on comparing the final answer with the ground truth. Indeed, we show that, when employing strong Large Language Models (LLMs) as automated judges for reasoning-aware evaluation rather than answer-only metrics, SLM performance drops significantly across all models and datasets, with scores decreasing by up to 25%.
翻译:尽管小语言模型(SLMs)在日益广泛的常识推理基准测试中展现出有前景的性能,但当前的评估实践几乎完全依赖于其最终答案的准确性,而忽视了得出这些答案的推理过程的有效性。为解决这一问题,我们提出了ReTraceQA这一新型基准,它引入了针对常识推理任务的过程级评估。我们经专家注释的数据集显示,在相当比例(14-24%)的实例中,SLMs尽管推理过程存在缺陷,却给出了正确的最终答案,这表明仅通过比较最终答案与真实值来评估的指标常常高估了SLMs的能力。事实上,我们证明,当使用强大大语言模型(LLMs)作为自动评判器进行面向推理的评估(而非仅关注答案的指标)时,所有模型和数据集的SLM性能均显著下降,得分降幅高达25%。