This article suggests a reasoning-guided vision-language-motion diffusion framework (RG-VLMD) for generating instruction-aware co-speech gestures for humanoid robots in educational scenarios. The system integrates multi-modal affective estimation, pedagogical reasoning, and teaching-act-conditioned motion synthesis to enable adaptive and semantically consistent robot behavior. A gated mixture-of-experts model predicts Valence/Arousal from input text, visual, and acoustic features, which then mapped to discrete teaching-act categories through an affect-driven policy.These signals condition a diffusion-based motion generator using clip-level intent and frame-level instructional schedules via additive latent restriction with auxiliary action-group supervision. Compared to a baseline diffusion model, our proposed method produces more structured and distinctive motion patterns, as verified by motion statics and pairwise distance analysis. Generated motion sequences remain physically plausible and can be retargeted to a NAO robot for real-time execution. The results reveal that reasoning-guided instructional conditioning improves gesture controllability and pedagogical expressiveness in educational human-robot interaction.
翻译:本文提出了一种推理引导的视觉-语言-运动扩散框架(RG-VLMD),用于在教育场景中为人形机器人生成与指令一致的伴语手势。该系统整合了多模态情感估计、教学推理与教学行为条件化的运动合成,使机器人能够实现自适应且语义一致的行为。通过门控混合专家模型,从输入文本、视觉和声学特征中预测效价/唤醒度,进而通过情感驱动的策略映射为离散的教学行为类别。这些信号通过片段级意图和帧级教学计划,采用附加潜在约束与辅助动作组监督的方式,对基于扩散模型的运动生成器进行条件化。与基线扩散模型相比,我们提出的方法能够生成更具结构性和区分度的运动模式,这一结论通过运动统计学分析和成对距离分析得到验证。生成的动作序列在物理上保持合理,并可重定向至NAO机器人进行实时执行。结果表明,推理引导的教学条件化能够提升教育人机交互中手势的可控性与教学表现力。