In VR interactions with embodied conversational agents, users' emotional intent is often conveyed more by how something is said than by what is said. However, most VR agent pipelines rely on speech-to-text processing, discarding prosodic cues and often producing emotionally incongruent responses despite correct semantics. We propose an emotion-context-aware VR interaction pipeline that treats vocal emotion as explicit dialogue context in an LLM-based conversational agent. A real-time speech emotion recognition model infers users' emotional states from prosody, and the resulting emotion labels are injected into the agent's dialogue context to shape response tone and style. Results from a within-subjects VR study (N=30) show significant improvements in dialogue quality, naturalness, engagement, rapport, and human-likeness, with 93.3% of participants preferring the emotion-aware agent.
翻译:在与具身对话智能体的VR交互中,用户的情感意图往往更多地通过说话方式而非说话内容传递。然而,大多数VR智能体处理流程依赖于语音转文本处理,丢弃了韵律线索,尽管语义正确却常产生情感不协调的回应。我们提出一种情感语境感知的VR交互流程,将语音情感视为基于LLM的对话智能体中的显式对话语境。一个实时语音情感识别模型从韵律推断用户情感状态,生成的情感标签被注入智能体的对话语境中以塑造回应语气与风格。一项被试内VR研究(N=30)的结果显示,在对话质量、自然度、参与感、亲和力及拟人化程度方面均有显著提升,93.3%的参与者更倾向于情感感知型智能体。