Improving Human Verification of LLM Reasoning through Interactive Explanation Interfaces

The reasoning capabilities of Large Language Models (LLMs) have led to their increasing employment in several critical applications, particularly education, where they support problem-solving, tutoring, and personalized study. Chain-of-thought (CoT) reasoning capabilities [1, 2] are well-known to help LLMs decompose a problem into steps and explore the solution spaces more effectively, leading to impressive performance on mathematical and reasoning benchmarks. As the length of CoT tokens per question increases substantially to even thousands of tokens per question [ 1], it is unknown how users could comprehend LLM reasoning and detect errors or hallucinations. To address this problem and understand how reasoning can improve human-AI interaction, we present three new interactive reasoning interfaces: interactive CoT (iCoT), interactive Program-of-Thought (iPoT), and interactive Graph (iGraph). That is, we ask LLMs themselves to generate an interactive web interface wrapped around the original CoT content, which may be presented in text (iCoT), graphs (iGraph) or code (iPoT). This interface allows users to interact with and provide a novel experience in reading and validating the reasoning chains of LLMs. Across a study of 125 participants, interactive interfaces significantly improve user performance. Specifically, iGraph users score the highest error detection rate (85.6%), followed by iPoT (82.5%), iCoT (80.6%), all outperforming standard CoT (73.5%). Interactive interfaces also lead to faster user validation time-iGraph users are faster (57.9 secs per question) than the users of iCoT and iPoT (60 secs) and the standard CoT (64.7 secs). A post-study questionnaire shows that users prefer iGraph, citing its superior ability to enable them to follow the LLM's reasoning. We discuss the implications of these results and provide recommendations for the future design of reasoning models.

翻译：大型语言模型（LLMs）的推理能力使其在多个关键应用中得到日益广泛的使用，尤其是在教育领域，它们支持问题解决、辅导和个性化学习。众所周知，思维链（CoT）推理能力[1, 2]有助于LLMs将问题分解为步骤并更有效地探索解空间，从而在数学和推理基准测试中取得令人印象深刻的性能。随着每个问题的CoT标记长度大幅增加，甚至达到每个问题数千个标记[1]，用户如何理解LLM的推理并检测错误或幻觉尚不清楚。为解决此问题并理解推理如何改善人机交互，我们提出了三种新的交互式推理界面：交互式思维链（iCoT）、交互式程序思维（iPoT）和交互式图（iGraph）。具体而言，我们要求LLMs自身生成一个围绕原始CoT内容包装的交互式Web界面，该界面可以以文本（iCoT）、图（iGraph）或代码（iPoT）形式呈现。此界面允许用户交互式地阅读和验证LLMs的推理链，提供新颖的体验。在一项涉及125名参与者的研究中，交互式界面显著提高了用户性能。具体而言，iGraph用户的错误检测率最高（85.6%），其次是iPoT（82.5%）和iCoT（80.6%），均优于标准CoT（73.5%）。交互式界面还缩短了用户验证时间——iGraph用户（每个问题57.9秒）比iCoT和iPoT用户（60秒）以及标准CoT用户（64.7秒）更快。一项研究后问卷调查显示，用户更偏好iGraph，认为其能更有效地帮助他们跟踪LLM的推理过程。我们讨论了这些结果的意义，并为未来推理模型的设计提供了建议。