We introduce a novel deep learning-based audio-visual quality (AVQ) prediction model that leverages internal features from state-of-the-art unimodal predictors. Unlike prior approaches that rely on simple fusion strategies, our model employs a hybrid representation that combines learned Generative Machine Listener (GML) audio features with hand-crafted Video Multimethod Assessment Fusion (VMAF) video features. Attention mechanisms capture cross-modal interactions and intra-modal relationships, yielding context-aware quality representations. A modality relevance estimator quantifies each modality's contribution per content, potentially enabling adaptive bitrate allocation. Experiments demonstrate improved AVQ prediction accuracy and robustness across diverse content types.
翻译:本文提出了一种新颖的基于深度学习的视听质量(AVQ)预测模型,该模型利用了来自先进单模态预测器的内部特征。与先前依赖简单融合策略的方法不同,本模型采用了一种混合表示,将学习到的生成式机器听者(GML)音频特征与手工设计的视频多方法评估融合(VMAF)视频特征相结合。注意力机制捕捉跨模态交互和模态内关系,从而产生上下文感知的质量表示。模态相关性估计器量化每种模态针对不同内容的贡献度,有望实现自适应比特率分配。实验表明,该模型在多种内容类型上均表现出更高的AVQ预测准确性和鲁棒性。