This paper proposes a multi-agent artificial intelligence system that generates response-oriented media content in real time based on audio-derived emotional signals. Unlike conventional speech emotion recognition studies that focus primarily on classification accuracy, our approach emphasizes the transformation of inferred emotional states into safe, age-appropriate, and controllable response content through a structured pipeline of specialized AI agents. The proposed system comprises four cooperative agents: (1) an Emotion Recognition Agent with CNN-based acoustic feature extraction, (2) a Response Policy Decision Agent for mapping emotions to response modes, (3) a Content Parameter Generation Agent for producing media control parameters, and (4) a Safety Verification Agent enforcing age-appropriateness and stimulation constraints. We introduce an explicit safety verification loop that filters generated content before output, ensuring compliance with predefined rules. Experimental results on public datasets demonstrate that the system achieves 73.2% emotion recognition accuracy, 89.4% response mode consistency, and 100% safety compliance while maintaining sub-100ms inference latency suitable for on-device deployment. The modular architecture enables interpretability and extensibility, making it applicable to child-adjacent media, therapeutic applications, and emotionally responsive smart devices.
翻译:本文提出了一种多智能体人工智能系统,该系统能够基于音频衍生的情感信号实时生成面向响应的媒体内容。与主要关注分类准确性的传统语音情感识别研究不同,我们的方法强调通过一个由专门AI智能体构成的结构化流程,将推断出的情感状态转化为安全、适龄且可控的响应内容。所提出的系统包含四个协作智能体:(1) 一个采用基于CNN的声学特征提取的情感识别智能体,(2) 一个用于将情感映射到响应模式的响应策略决策智能体,(3) 一个用于生成媒体控制参数的内容参数生成智能体,以及(4) 一个强制执行适龄性和刺激约束的安全验证智能体。我们引入了一个显式的安全验证循环,在输出前对生成的内容进行过滤,确保其符合预定义的规则。在公共数据集上的实验结果表明,该系统实现了73.2%的情感识别准确率、89.4%的响应模式一致性以及100%的安全合规性,同时保持了低于100毫秒的推理延迟,适合在设备端部署。其模块化架构实现了可解释性和可扩展性,使其可应用于儿童相关媒体、治疗应用和情感响应智能设备。