Traditional lecture videos offer flexibility but lack mechanisms for real-time clarification, forcing learners to search externally when confusion arises. Recent advances in large language models and neural avatars provide new opportunities for interactive learning, yet existing systems typically lack lecture awareness, rely on cloud-based services, or fail to integrate retrieval and avatar-delivered explanations in a unified, privacy-preserving pipeline. We present ALIVE, an Avatar-Lecture Interactive Video Engine that transforms passive lecture viewing into a dynamic, real-time learning experience. ALIVE operates fully on local hardware and integrates (1) Avatar-delivered lecture generated through ASR transcription, LLM refinement, and neural talking-head synthesis; (2) A content-aware retrieval mechanism that combines semantic similarity with timestamp alignment to surface contextually relevant lecture segments; and (3) Real-time multimodal interaction, enabling students to pause the lecture, ask questions through text or voice, and receive grounded explanations either as text or as avatar-delivered responses. To maintain responsiveness, ALIVE employs lightweight embedding models, FAISS-based retrieval, and segmented avatar synthesis with progressive preloading. We demonstrate the system on a complete medical imaging course, evaluate its retrieval accuracy, latency characteristics, and user experience, and show that ALIVE provides accurate, content-aware, and engaging real-time support. ALIVE illustrates how multimodal AI-when combined with content-aware retrieval and local deployment-can significantly enhance the pedagogical value of recorded lectures, offering an extensible pathway toward next-generation interactive learning environments.
翻译:传统讲座视频虽具灵活性,却缺乏实时答疑机制,学习者在产生困惑时往往需转向外部搜索。大型语言模型与神经虚拟形象的最新进展为交互式学习提供了新机遇,然而现有系统通常缺乏对讲座内容的感知能力、依赖云端服务,或未能将检索功能与虚拟形象讲解整合至统一且保护隐私的处理流程中。本文提出ALIVE(Avatar-Lecture Interactive Video Engine),该系统将被动观看讲座转化为动态的实时学习体验。ALIVE完全在本地硬件上运行,并整合了以下功能:(1)通过自动语音识别转录、大型语言模型优化及神经说话头合成技术生成的虚拟形象讲座;(2)结合语义相似度与时间戳对齐的内容感知检索机制,可提取语境相关的讲座片段;(3)实时多模态交互功能,允许学生暂停讲座、通过文字或语音提问,并获得以文字或虚拟形象形式呈现的具象化解答。为保持系统响应速度,ALIVE采用轻量化嵌入模型、基于FAISS的检索技术,以及分段式虚拟形象合成与渐进预加载策略。我们在完整的医学影像课程上演示了该系统,评估了其检索准确率、延迟特性与用户体验,结果表明ALIVE能够提供精准、内容感知且具吸引力的实时学习支持。ALIVE展示了多模态人工智能与内容感知检索及本地化部署相结合时,如何显著提升录播讲座的教学价值,为构建下一代交互式学习环境提供了可扩展的路径。