Deep learning models for multimodal expression recognition have reached remarkable performance in controlled laboratory environments because of their ability to learn complementary and redundant semantic information. However, these models struggle in the wild, mainly because of the unavailability and quality of modalities used for training. In practice, only a subset of the training-time modalities may be available at test time. Learning with privileged information enables models to exploit data from additional modalities that are only available during training. State-of-the-art knowledge distillation (KD) methods have been proposed to distill information from multiple teacher models (each trained on a modality) to a common student model. These privileged KD methods typically utilize point-to-point matching, yet have no explicit mechanism to capture the structural information in the teacher representation space formed by introducing the privileged modality. Experiments were performed on two challenging problems - pain estimation on the Biovid dataset (ordinal classification) and arousal-valance prediction on the Affwild2 dataset (regression). Results show that our proposed method can outperform state-of-the-art privileged KD methods on these problems. The diversity among modalities and fusion architectures indicates that PKDOT is modality- and model-agnostic.
翻译:用于多模态表情识别的深度学习模型在受控实验室环境中表现出色,这是因为它们能够学习互补与冗余的语义信息。然而,这些模型在真实场景中表现不佳,主要原因是训练所用模态的不可得性与质量差异。实际应用中,测试时可能仅有训练阶段部分模态可用。特权信息学习使模型能够利用仅在训练阶段可用的额外模态数据。现有最先进的知识蒸馏方法通过多个教师模型(每个模型基于单一模态训练)向共用学生模型蒸馏信息。这些特权知识蒸馏方法通常采用点对点匹配策略,但缺乏明确机制来捕获因引入特权模态而形成的教师表征空间中的结构信息。我们在两个具有挑战性的问题——Biovid数据集上的疼痛估计(序数分类)与Affwild2数据集上的激活-效价预测(回归)——上进行了实验。结果表明,我们提出的方法在这些问题上优于现有最先进的特权知识蒸馏方法。模态与融合架构的多样性表明,PKDOT具有模态与模型的无关性。