Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

Real-time audio-visual speech enhancement (AVSE) is a key enabler for immersive and interactive multimedia services, yet its performance is tightly constrained by network latency, uplink capacity, and computational delay. This paper presents the design, deployment, and evaluation of a complete cloud-edge-assisted AVSE system operating over a public 5G edge network. The system integrates CNN-based acoustic enhancement and OpenCV-based facial feature extraction with an LSTM fusion network to preserve temporal coherence, and is deployed on a Vodafone-compatible AWS Wavelength edge cloud. Through extensive stress testing, we analyze end-to-end performance under varying network load and adaptive multimedia profiles. Results show that compute placement at the network edge is critical for meeting real-time coherence constraints, and that uplink capacity is often the dominant bottleneck for interactive AVSE services. Only 5G and wired Ethernet consistently satisfied the required communication delay bound for uncompressed audio-video chunks, while aggressive compression reduced payload sizes by up to 80% with negligible perceptual degradation, enabling robust operation under constrained conditions. We further demonstrate a fundamental trade-off between processing latency and enhancement quality, where reduced model complexity lowers delay but degrades reconstruction performance in low-SNR scenarios. Our findings indicate that public 5G edge environments can sustain real-time, interactive AVSE workloads when network and compute resources are carefully orchestrated, although performance margins remain tighter than in dedicated infrastructures. The architectural insights derived from this study provide practical guidelines for the design of delay-sensitive multimedia and perceptual enhancement services on emerging 5G edge-cloud platforms.

翻译：实时音视频语音增强（AVSE）是实现沉浸式交互多媒体服务的关键技术，但其性能受网络延迟、上行链路容量和计算延迟的严格约束。本文介绍了在公共5G边缘网络上运行的完整云边协同AVSE系统的设计、部署与评估。该系统将基于CNN的声音增强模块、基于OpenCV的人脸特征提取模块与LSTM融合网络相结合以保持时间一致性，并部署于兼容沃达丰的AWS Wavelength边缘云。通过大规模压力测试，我们分析了不同网络负载和自适应多媒体配置下的端到端性能。结果表明，计算任务在网络边缘的部署位置对满足实时一致性约束至关重要，而上行链路容量通常是交互式AVSE服务的主要瓶颈。仅5G和有线以太网能持续满足未压缩音视频数据块所需的通信延迟边界，而强压缩技术可在感知质量可忽略下降的情况下将负载大小缩减80%，从而保障受限环境下的稳健运行。我们进一步证明了处理延迟与增强质量之间的基本权衡：降低模型复杂度虽能减少延迟，但在低信噪比场景下会削弱重建性能。研究结果表明，当网络与计算资源得到精心协调时，公共5G边缘环境能够支撑实时交互式AVSE工作负载，但性能裕度仍显著低于专用基础设施。本研究所获得的架构洞见为在新型5G边缘云平台上设计时延敏感型多媒体与感知增强服务提供了实践指导。