In this study, we present a multimodal framework for predicting neuro-facial disorders by capturing both vocal and facial cues. We hypothesize that explicitly disentangling shared and modality-specific representations within multimodal foundation model embeddings can enhance clinical interpretability and generalization. To validate this hypothesis, we propose DIVINE a fully disentangled multimodal framework that operates on representations extracted from state-of-the-art (SOTA) audio and video foundation models, incorporating hierarchical variational bottlenecks, sparse gated fusion, and learnable symptom tokens. DIVINE operates in a multitask learning setup to jointly predict diagnostic categories (Healthy Control,ALS, Stroke) and severity levels (Mild, Moderate, Severe). The model is trained using synchronized audio and video inputs and evaluated on the Toronto NeuroFace dataset under full (audio-video) as well as single-modality (audio-only and video-only) test conditions. Our proposed approach, DIVINE achieves SOTA result, with the DeepSeek-VL2 and TRILLsson combination reaching 98.26% accuracy and 97.51% F1-score. Under modality-constrained scenarios, the framework performs well, showing strong generalization when tested with video-only or audio-only inputs. It consistently yields superior performance compared to unimodal models and baseline fusion techniques. To the best of our knowledge, DIVINE is the first framework that combines cross-modal disentanglement, adaptive fusion, and multitask learning to comprehensively assess neurological disorders using synchronized speech and facial video.
翻译:本研究提出了一种通过捕捉声音和面部线索来预测神经面部疾病的多模态框架。我们假设,在多模态基础模型嵌入中显式解耦共享表征和模态特定表征能够增强临床可解释性和泛化能力。为验证这一假设,我们提出了DIVINE——一个完全解耦的多模态框架,该框架基于从最先进的音频和视频基础模型中提取的表征进行操作,并融入了分层变分瓶颈、稀疏门控融合和可学习的症状标记。DIVINE在多任务学习设置下运行,以联合预测诊断类别(健康对照组、肌萎缩侧索硬化症、中风)和严重程度(轻度、中度、重度)。该模型使用同步的音频和视频输入进行训练,并在多伦多NeuroFace数据集上,在完整(音频-视频)以及单模态(仅音频和仅视频)测试条件下进行评估。我们提出的方法DIVINE取得了最先进的结果,其中DeepSeek-VL2与TRILLsson的组合达到了98.26%的准确率和97.51%的F1分数。在模态受限的场景下,该框架表现良好,在仅视频或仅音频输入测试时显示出强大的泛化能力。与单模态模型和基线融合技术相比,它始终能产生更优的性能。据我们所知,DIVINE是首个结合跨模态解耦、自适应融合和多任务学习,利用同步语音和面部视频全面评估神经系统疾病的框架。