Large Language Model interfaces are increasingly verbose, exposing intermediate reasoning traces alongside final answers. Traces are framed as transparency mechanisms, yet it is unclear how people use them to solve problems. We report a preregistered between-subjects study (N = 559) in which participants solved ten LSAT-style reasoning problems under one of three conditions: an Answer-only baseline, a Full-trace revealed before the answer, and a Summary-trace presented alongside the answer. Summaries preserved task performance at the no-trace baseline while significantly elevating trust and hedonic appeal, establishing that trace exposure shifts subjective appraisal of the interaction without bringing performance benefits. Under an open-weight reasoning model exposing verbose intermediate output, full traces additionally impaired performance relative to the answer-only baseline. Across all conditions, participants substantially overestimated their performance, and no trace format supported calibrated self-evaluation. Further analysis indicates that hedonic appeal, not trust, carries the indirect path to overestimation, consistent with a processing-fluency account. Reasoning traces are best understood as user-facing interface artifacts rather than transparent windows into model cognition, and calibration is unlikely to emerge from the traces themselves and may best be scaffolded by interactions that elicit users' own reasoning first.
翻译:大语言模型界面日益冗长,在给出最终答案的同时暴露中间推理轨迹。这些轨迹被设计为透明度机制,但尚不清楚人们如何利用它们来解决问题。我们报告了一项预注册的受试者间研究(N = 559),参与者在三种条件下解决十道LSAT风格的推理问题:仅答案基线、答案前展示完整轨迹、以及答案旁附带摘要轨迹。摘要轨迹在无轨迹基线水平上保持了任务性能,同时显著提升了信任感和享乐吸引力,这表明轨迹暴露改变了交互的主观评价,但并未带来性能提升。在暴露冗长中间输出的开放式权重推理模型下,相较于仅答案基线,完整轨迹进一步损害了性能。在所有条件下,参与者显著高估了自身表现,且没有任何轨迹格式能够支持校准的自我评估。进一步分析表明,享乐吸引力而非信任,通过间接路径导致了过度自信,这与处理流畅性假说一致。推理轨迹最好被理解为面向用户的界面产物,而非模型认知的透明窗口;校准不太可能从轨迹本身产生,最佳方法可能是通过首先引导用户自身推理的交互来搭建支架。