This paper explores the system 1 thinking capability of Large Reasoning Models (LRMs), the intuitive ability to respond efficiently with minimal token usage. While existing LRMs rely on long-chain reasoning and excel at complex tasks, their system 1 thinking ability remains largely underexplored. This capability is essential as it reflects models' difficulty awareness and reasoning efficiency, both critical for real-world applications. We propose S1-Bench, a multi-domain, multilingual benchmark comprising model-simple system 1 questions. Our investigation of 28 LRMs reveals under-accuracy and inefficiency on system 1 problems. We find existing efficient reasoning methods either generalize poorly to simple questions or sacrifice performance for efficiency. Further exploration uncovers LRMs' early difficulty awareness accompanied by lower confidence, and shows that problem difficulty is implicitly encoded in hidden states.
翻译:暂无翻译