Talking Slide Avatars: Open-Source Multimodal Communication Approach for Teaching

Slide-based teaching is widely used in higher education, yet in online, hybrid, and asynchronous contexts, slides often lose the instructor presence, narrative continuity, and expressive framing that help learners connect with content. Full lecture video can partly restore these qualities, but it is time-consuming to record, revise, and reuse. This study addresses that pedagogical and production challenge by presenting a practice-based analysis of an open-source workflow for creating talking slide avatars for slide-based teaching. The workflow integrates OpenVoice for text-to-speech generation and voice cloning with Ditto-TalkingHead for audio-driven talking-image synthesis, enabling instructors to transform a script and a static portrait into a short narrated video that can be embedded in slide decks or HTML-based lecture materials. Rather than treating this workflow merely as a technical solution, the study frames talking slide avatars as multimodal communication artifacts at the intersection of digital pedagogy, aesthetic education, and art-technology practice. Using a practice-based implementation and analytic reflection approach, the study documents the production pipeline, examines its communicative and aesthetic affordances, and proposes practical guidelines for script length, image selection, pacing, disclosure, accessibility, and ethical use. The study makes three primary contributions: it presents an educator-oriented open-source production model, reframes talking avatars as an educational communication design problem, and proposes a responsible pathway for incorporating generative synthetic media into teaching. It concludes that short, transparent, and carefully designed avatars can humanize slide-based instruction while providing a reusable communicative layer for introductions, transitions, reminders, and recaps across online, hybrid, and asynchronous learning environments.

翻译：基于幻灯片的讲授在高等教育中得到广泛应用，但在线上、混合与异步教学情境中，幻灯片往往缺失教师临场感、叙事连贯性及表达框架，这些元素本可帮助学习者与教学内容建立联结。完整讲座视频虽能部分恢复这些特质，但录制、修订和复用耗时长。本研究通过实践导向的分析，针对这一教学与制作挑战，提出了一种用于创建幻灯片教学"说话虚拟助教"的开源工作流。该工作流整合了用于文本转语音生成与语音克隆的OpenVoice，以及用于音频驱动的说话图像合成的Ditto-TalkingHead，使教师能够将脚本与静态肖像转化为可嵌入幻灯片系列或基于HTML的讲座材料的短叙事视频。本研究并未将该工作流仅视为技术方案，而是将说话幻灯片虚拟助教定位为数字教学法、审美教育与艺术-科技实践交叉领域的多模态沟通产物。采用基于实践的实施与反思分析方法，本研究记录了生产流程，审视了其沟通与审美可供性，并提出了关于脚本长度、图像选择、节奏、透明度、可及性与伦理使用的实践指南。本研究做出三项主要贡献：提出了面向教育者的开源生产模型；将说话虚拟助教重述为教育沟通设计问题；提出了将生成式合成媒体纳入教学的责任路径。结论指出，简短、透明且精心设计的虚拟助教可在线上、混合与异步学习环境中为人性化幻灯片教学提供可复用的沟通层，适用于引言、过渡、提醒与总结场景。