Technological developments have produced methods that can generate educational videos from input text or sound. Recently, the use of deep learning techniques for image and video generation has been widely explored, particularly in education. However, generating video content from conditional inputs such as text or speech remains a challenging area. In this paper, we introduce a novel method to the educational structure, Generative Adversarial Network (GAN), which develop frame-for-frame frameworks and are able to create full educational videos. The proposed system is structured into three main phases In the first phase, the input (either text or speech) is transcribed using speech recognition. In the second phase, key terms are extracted and relevant images are generated using advanced models such as CLIP and diffusion models to enhance visual quality and semantic alignment. In the final phase, the generated images are synthesized into a video format, integrated with either pre-recorded or synthesized sound, resulting in a fully interactive educational video. The proposed system is compared with other systems such as TGAN, MoCoGAN, and TGANS-C, achieving a Fréchet Inception Distance (FID) score of 28.75%, which indicates improved visual quality and better over existing methods.
翻译:技术发展已催生出能够从输入文本或语音生成教育视频的方法。近年来,深度学习技术在图像与视频生成领域的应用得到了广泛探索,尤其是在教育场景中。然而,基于文本或语音等条件输入生成视频内容仍是一个具有挑战性的研究方向。本文针对教育架构提出了一种新颖方法——生成对抗网络(GAN),该网络可构建逐帧生成框架,并能创建完整的教育视频。所提出的系统分为三个主要阶段:第一阶段,通过语音识别将输入(文本或语音)转录为文字;第二阶段,提取关键术语并利用CLIP、扩散模型等先进模型生成相关图像,以提升视觉质量与语义对齐度;第三阶段,将生成的图像合成为视频格式,并与预录制或合成的语音进行整合,最终形成完整的交互式教育视频。本系统与TGAN、MoCoGAN、TGANS-C等其他系统进行了对比实验,其Fréchet Inception距离(FID)得分为28.75%,表明其在视觉质量上优于现有方法并取得了显著提升。