Benchmark saturation and contamination have obscured genuine advances in reasoning for large language models (LLMs). We introduce NazoNazo Benchmark, a low-cost, renewable test built from Japanese children's riddles that demand insight-based reasoning, or representational shifts rather than knowledge recall. We evaluate 38 frontier LLMs (2023-2025) on 201 riddles and a 120-item human-comparison subset, finding that non-reasoning models average 7.6%, reasoning models 17.6%, and humans ~53% accuracy. Importantly, thought-log analysis reveals that reasoning in Japanese did not necessarily improve accuracy, indicating that language understanding alone is insufficient for insight reasoning. Notably, models sometimes generated correct candidates but failed to endorse them, suggesting weak metacognitive control rather than a lack of knowledge. This "verification failure" indicates that CoT outputs can reflect genuine intermediate reasoning states rather than post-hoc rationalizations. By exposing this metacognitive bottleneck - models' inability to recognize when they are right - the benchmark provides a scalable, cross-linguistic testbed for studying machine insight, confidence calibration, and self-evaluation. NazoNazo Benchmark thus offers not only a fresh challenge to current LLMs but also a concrete target for developing AI metacognitive psychology and enhancing machine Aha! capability.
翻译:基准测试的饱和与污染已掩盖了大型语言模型(LLM)在推理能力方面的真实进展。我们提出了NazoNazo Benchmark,这是一个低成本、可再生的测试集,基于日本儿童谜语构建,要求基于洞察力的推理或表征转换,而非知识回忆。我们在201个谜语及包含120项的人机对比子集上评估了38个前沿LLM(2023-2025年),发现非推理模型的平均准确率为7.6%,推理模型为17.6%,而人类约为53%。重要的是,思维日志分析表明,使用日语进行推理并不必然提升准确率,这显示仅靠语言理解不足以实现洞察推理。值得注意的是,模型有时能生成正确答案候选项却未能最终采纳,这表明其问题在于元认知控制薄弱而非知识缺失。这种“验证失败”现象表明,思维链(CoT)输出可能反映真实的中间推理状态,而非事后合理化。通过揭示这一元认知瓶颈——即模型无法识别自身何时正确——该基准为研究机器洞察力、置信度校准与自我评估提供了一个可扩展的跨语言测试平台。因此,NazoNazo Benchmark不仅为当前LLM提出了新的挑战,也为发展AI元认知心理学与增强机器的“顿悟”能力提供了具体目标。