Large Language Models (LLMs) challenge conventional automated programming assessment because students can now produce functionally correct code without demonstrating corresponding understanding. This paper makes two contributions. First, it reports a saturation-based scoping review of conversational assessment approaches in programming education. The review identifies three dominant architectural families: rule-based or template-driven systems, LLM-based systems, and hybrid systems. Across the literature, conversational agents appear promising for scalable feedback and deeper probing of code understanding, but important limitations remain around hallucinations, over-reliance, privacy, integrity, and deployment constraints. Second, the paper synthesizes these findings into a Hybrid Socratic Framework for integrating conversational verification into Automated Programming Assessment Systems (APASs). The framework combines deterministic code analysis with a dual-agent conversational layer, knowledge tracking, scaffolded questioning, and guardrails that tie prompts to runtime facts. The paper also discusses practical safeguards against LLM-generated explanations, including proctored deployment modes, randomized trace questions, stepwise reasoning tied to concrete execution states, and local-model deployment options for privacy-sensitive settings. Rather than replacing conventional testing, the framework is intended as a complementary layer for verifying whether students understand the code they submit.
翻译:大型语言模型(LLMs)对传统自动编程评估提出了挑战,因为学生现在能够生成功能正确的代码,却未能展现相应的理解能力。本文做出了两项贡献。首先,它报告了一项基于饱和度的编程教育中对话式评估方法的范围综述。该综述识别出三种主要的架构类型:基于规则或模板的系统、基于LLM的系统以及混合系统。现有文献表明,对话式代理在可扩展反馈和深入探询代码理解方面颇具前景,但仍存在关于幻觉、过度依赖、隐私、完整性和部署约束等重要局限。其次,本文将这些发现综合成一个混合苏格拉底框架,用于将对话式验证集成到自动编程评估系统(APASs)中。该框架将确定性代码分析与双层对话代理、知识追踪、递进式提问以及将提示与运行时事实关联的护栏机制相结合。本文还讨论了对LLM生成解释的实际保障措施,包括监考部署模式、随机追踪问题、与具体执行状态关联的逐步推理,以及适用于隐私敏感场景的本地模型部署选项。该框架并非取代传统测试,而是作为验证学生是否理解其所提交代码的补充层。