大型语言模型中的模拟自我评估：一种面向人工智能自我效能感的心理测量学方法 (Simulated Self-Assessment in Large Language Models: A Psychometric Approach to AI Self-Efficacy)

Self-assessment is a key aspect of reliable intelligence, yet evaluations of large language models (LLMs) focus mainly on task accuracy. We adapted the 10-item General Self-Efficacy Scale (GSES) to elicit simulated self-assessments from ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization. GSES responses were highly stable across repeated administrations and randomized item orders. However, models showed significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. All models achieved perfect accuracy on computational and social questions, whereas summarization performance varied widely. Self-assessment did not reliably reflect ability: several low-scoring models performed accurately, while some high-scoring models produced weaker summaries. Follow-up confidence prompts yielded modest, mostly downward revisions, suggesting mild overestimation in first-pass assessments. Qualitative analysis showed that higher self-efficacy corresponded to more assertive, anthropomorphic reasoning styles, whereas lower scores reflected cautious, de-anthropomorphized explanations. Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates.

翻译：自我评估是可靠智能的关键方面，然而对大型语言模型（LLMs）的评估主要集中于任务准确性。我们改编了包含10个条目的通用自我效能感量表（GSES），以在四种条件下（无任务、计算推理、社会推理和摘要生成）从十个LLMs中获取模拟自我评估。GSES的响应在重复施测和随机条目顺序下表现出高度稳定性。然而，模型在不同条件下显示出显著不同的自我效能感水平，其总分低于人类常模。所有模型在计算和社会推理问题上均达到完美准确率，而摘要生成性能则差异显著。自我评估未能可靠反映能力：多个低分模型表现准确，而部分高分模型生成的摘要质量较弱。后续置信度提示导致适度且多为向下的修正，表明初次评估存在轻微高估。定性分析显示，较高的自我效能感对应更为自信、拟人化的推理风格，而较低分数则反映谨慎、去拟人化的解释。心理测量学提示为LLM的交流行为提供了结构化洞察，但并未提供校准的性能估计。