Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.
翻译:共情式口语对话系统必须推断用户情绪状态以做出恰当回应,但日常语音往往带有微弱、中性或模糊的情感线索。为此,我们提出Sympatheia——一种以语音为基础的对话框架,其调节条件包括从用户语音推断的情感信息,以及(当可用时)由多模态传感模块或用户界面提供的连续效价-唤醒度(VA)控制信号所明确指定的情感参数。为训练模型,我们构建了包含12个情感锚点的情感条件合成口语对话语料库Sympatheia-18k。该数据集包含用于学习情感型语音行为的情感分割,以及将中性情感查询与多种情感条件响应配对的中性分割,以在情感模糊情形下实现显式情感控制的隔离。实验结果表明,Sympatheia在生成语义内容和语音表达均具情感恰当性的响应方面,优于语音会话基线模型。我们进一步证明,同一VA界面可整合来自面部表情、生物信号和文本情感描述等不同传感模块的情感估计值,从而在单一语音提供有限情感证据时改善响应对齐。这些结果表明,连续情感调节是构建情感自适应语音助手的有效实用方案。