通过交互式解释界面提升人类对大语言模型推理的验证能力 (Improving Human Verification of LLM Reasoning through Interactive Explanation Interfaces)

The reasoning capabilities of Large Language Models (LLMs) have led to their increasing employment in several critical applications, particularly education, where they support problem-solving, tutoring, and personalized study. Chain-of-thought (CoT) reasoning capabilities [1, 2] are well-known to help LLMs decompose a problem into steps and explore the solution spaces more effectively, leading to impressive performance on mathematical and reasoning benchmarks. As the length of CoT tokens per question increases substantially to even thousands of tokens per question [ 1], it is unknown how users could comprehend LLM reasoning and detect errors or hallucinations. To address this problem and understand how reasoning can improve human-AI interaction, we present three new interactive reasoning interfaces: interactive CoT (iCoT), interactive Program-of-Thought (iPoT), and interactive Graph (iGraph). That is, we ask LLMs themselves to generate an interactive web interface wrapped around the original CoT content, which may be presented in text (iCoT), graphs (iGraph) or code (iPoT). This interface allows users to interact with and provide a novel experience in reading and validating the reasoning chains of LLMs. Across a study of 125 participants, interactive interfaces significantly improve user performance. Specifically, iGraph users score the highest error detection rate (85.6%), followed by iPoT (82.5%), iCoT (80.6%), all outperforming standard CoT (73.5%). Interactive interfaces also lead to faster user validation time-iGraph users are faster (57.9 secs per question) than the users of iCoT and iPoT (60 secs) and the standard CoT (64.7 secs). A post-study questionnaire shows that users prefer iGraph, citing its superior ability to enable them to follow the LLM's reasoning. We discuss the implications of these results and provide recommendations for the future design of reasoning models.

翻译：大语言模型（LLMs）的推理能力使其在多个关键应用中得到日益广泛的使用，尤其是在教育领域，它们为问题解决、辅导和个性化学习提供了支持。众所周知，思维链（CoT）推理能力[1, 2]有助于LLMs将问题分解为步骤并更有效地探索解空间，从而在数学和推理基准测试中取得了令人印象深刻的性能。随着每个问题的CoT令牌长度大幅增加，甚至达到每个问题数千个令牌[1]，用户如何理解LLM的推理过程并检测其中的错误或幻觉尚不明确。为了解决这一问题并理解推理如何改善人机交互，我们提出了三种新的交互式推理界面：交互式思维链（iCoT）、交互式程序思维（iPoT）和交互式图（iGraph）。具体而言，我们要求LLMs自身围绕原始的CoT内容生成一个交互式Web界面，该界面可以文本（iCoT）、图（iGraph）或代码（iPoT）形式呈现。这一界面允许用户与LLM的推理链进行交互，并为阅读和验证这些推理链提供了一种新颖的体验。在一项涉及125名参与者的研究中，交互式界面显著提升了用户的表现。具体而言，iGraph用户的错误检测率最高（85.6%），其次是iPoT（82.5%）和iCoT（80.6%），三者均优于标准CoT（73.5%）。交互式界面还缩短了用户的验证时间——iGraph用户（每个问题57.9秒）比iCoT和iPoT用户（60秒）以及标准CoT用户（64.7秒）更快。一项研究后问卷调查显示，用户更偏好iGraph，认为其能更有效地帮助他们追踪LLM的推理过程。我们讨论了这些结果的意义，并对未来推理模型的设计提出了建议。