This paper explores the system 1 thinking capability of Large Reasoning Models (LRMs), the intuitive ability to respond efficiently with minimal token usage. While existing LRMs rely on long-chain reasoning and excel at complex tasks, their system 1 thinking ability remains largely underexplored. This capability is essential as it reflects models' difficulty awareness and reasoning efficiency, both critical for real-world applications. We propose S1-Bench, a multi-domain, multilingual benchmark comprising model-simple system 1 questions. Our investigation of 28 LRMs reveals under-accuracy and inefficiency on system 1 problems. We find existing efficient reasoning methods either generalize poorly to simple questions or sacrifice performance for efficiency. Further exploration uncovers LRMs' early difficulty awareness accompanied by lower confidence, and shows that problem difficulty is implicitly encoded in hidden states.
翻译:本文探索了大型推理模型(LRMs)的系统1思维能力,即利用最少的令牌使用量高效响应的直觉能力。尽管现有LRMs依赖长链推理并在复杂任务中表现出色,但其系统1思维能力仍鲜有研究。这一能力至关重要,因为它反映了模型对问题难度的感知和推理效率,两者对实际应用均不可或缺。我们提出了S1-Bench,一个包含模型简单系统1问题的多领域、多语言基准测试集。通过对28个LRMs的调研,我们发现其在系统1问题上存在准确性不足和效率低下的问题。现有高效推理方法要么对简单问题的泛化能力差,要么以牺牲性能为代价提升效率。进一步探索揭示了LRMs早期对问题难度的感知伴随较低的置信度,并且问题难度被隐含地编码在隐藏状态中。