人机交互：评估大语言模型在数字逻辑电路及图问题中的推理能力——以设计与分析中的创造性为视角 (Human-AI Interaction: Evaluating LLM Reasoning on Digital Logic Circuit included Graph Problems, in terms of creativity in design and analysis)

Large Language Models (LLMs) are increasingly used by undergraduate students as on-demand tutors, yet their reliability on circuit- and diagram-based digital logic problems remains unclear. We present a human- AI study evaluating three widely used LLMs (GPT, Gemini, and Claude) on 10 undergraduate-level digital logic questions spanning non-standard counters, JK-based state transitions, timing diagrams, frequency division, and finite-state machines. Twenty-four students performed pairwise model comparisons, providing per-question judgments on (i) preferred model, (ii) perceived correctness, (iii) consistency, (iv) verbosity, and (v) confidence, along with global ratings of overall model quality, satisfaction across multiple dimensions (e.g., accuracy and clarity), and perceived mental effort required to verify answers. To benchmark technical validity, we applied an independent judge-based evaluation against official solutions for all ten questions, using strict correctness criteria. Results reveal a consistent gap between perceived helpfulness and formal correctness: for the most sequentially demanding problems (Q1- Q7), none of the evaluated LLMs matched the official answers, despite producing confident, well-structured explanations that students often rated favorably. Error analysis indicates that models frequently default to canonical textbook templates (e.g., standard ripple counters) and struggle to translate circuit structure into exact state evolution and timing behavior. These findings suggest that, without verification scaffolds, LLMs may be unreliable for core digital logic topics and can inadvertently reinforce misconceptions in undergraduate instruction.

翻译：大语言模型正日益被本科生用作按需辅导工具，然而其在电路与图表类数字逻辑问题上的可靠性仍不明确。本研究通过人机交互实验，评估了三种广泛使用的大语言模型（GPT、Gemini和Claude）在10个本科级别数字逻辑问题上的表现，问题涵盖非标准计数器、基于JK触发器的状态转换、时序图、分频电路和有限状态机。24名学生进行了成对模型比较，针对每个问题提供了以下维度的评判：（i）偏好模型，（ii）感知正确性，（iii）一致性，（iv）冗余度，以及（v）置信度，同时从整体模型质量、多维度满意度（如准确性与清晰度）以及验证答案所需感知心智负荷等方面给出全局评分。为建立技术有效性基准，我们采用独立评审机制，依据严格正确性标准，将所有十个问题的模型输出与官方解答进行比对。结果显示感知帮助度与形式正确性之间存在系统性差距：对于最具时序复杂性的问题（Q1-Q7），尽管模型生成了解释自信、结构清晰的答案并常获学生好评，但所有被评估模型均未给出与官方答案一致的解答。错误分析表明，模型常默认采用经典教科书模板（如标准纹波计数器），且难以将电路结构准确转化为状态演化与时序行为。这些发现表明，若缺乏验证框架，大语言模型在核心数字逻辑主题上可能不可靠，并可能在本科教学中无意间强化错误概念。