Self-assessment is a key aspect of reliable intelligence, yet evaluations of large language models (LLMs) focus mainly on task accuracy. We adapted the 10-item General Self-Efficacy Scale (GSES) to elicit simulated self-assessments from ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization. GSES responses were highly stable across repeated administrations and randomized item orders. However, models showed significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. All models achieved perfect accuracy on computational and social questions, whereas summarization performance varied widely. Self-assessment did not reliably reflect ability: several low-scoring models performed accurately, while some high-scoring models produced weaker summaries. Follow-up confidence prompts yielded modest, mostly downward revisions, suggesting mild overestimation in first-pass assessments. Qualitative analysis showed that higher self-efficacy corresponded to more assertive, anthropomorphic reasoning styles, whereas lower scores reflected cautious, de-anthropomorphized explanations. Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates.
翻译:自我评估是可靠智能的关键方面,然而对大型语言模型(LLMs)的评估主要集中于任务准确性。我们改编了包含10个条目的通用自我效能感量表(GSES),以在四种条件下(无任务、计算推理、社会推理和摘要生成)从十个LLMs中获取模拟自我评估。GSES的响应在重复施测和随机条目顺序下表现出高度稳定性。然而,模型在不同条件下显示出显著不同的自我效能感水平,其总分低于人类常模。所有模型在计算和社会推理问题上均达到完美准确率,而摘要生成性能则差异显著。自我评估未能可靠反映能力:多个低分模型表现准确,而部分高分模型生成的摘要质量较弱。后续置信度提示导致适度且多为向下的修正,表明初次评估存在轻微高估。定性分析显示,较高的自我效能感对应更为自信、拟人化的推理风格,而较低分数则反映谨慎、去拟人化的解释。心理测量学提示为LLM的交流行为提供了结构化洞察,但并未提供校准的性能估计。