Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.
翻译:情感感知与自适应表达是人机交互中的核心能力。尽管近期语音情感描述(SEC)的进展提升了细粒度情感建模的水平,现有系统仍局限于孤立语句中的静态单情感表征,忽视了语篇层面的动态情感迁移。为弥补这一不足,我们提出情感迁移感知语音描述(EmoTransCap)范式,该范式将时序情感动态与语篇级语音描述相整合。为构建一个富含情感迁移且具备可扩展性的数据集,我们设计了数据集自动创建流程。这是首个专为捕捉语篇级情感迁移而设计的大规模数据集。为生成语义丰富的描述,我们融合了语篇级语音中的声学属性与时序线索。我们的多任务情感迁移识别(MTETR)模型实现了情感迁移检测与分割的联合建模。借助大语言模型(LLMs)的语义分析能力,我们生成了描述性与指令导向两种标注版本。这些数据与标注为推进情感感知与情感表达能力提供了宝贵资源。该数据集支持捕捉情感迁移的语音描述,有助于实现时序动态且细粒度的情感理解。我们还引入了一个语篇级可控的情感迁移感知语音合成系统,在增强类人情感表达能力的同时,为具有情感智能的对话体提供技术支持。