We explore the efficacy of multimodal behavioral cues for explainable prediction of personality and interview-specific traits. We utilize elementary head-motion units named kinemes, atomic facial movements termed action units and speech features to estimate these human-centered traits. Empirical results confirm that kinemes and action units enable discovery of multiple trait-specific behaviors while also enabling explainability in support of the predictions. For fusing cues, we explore decision and feature-level fusion, and an additive attention-based fusion strategy which quantifies the relative importance of the three modalities for trait prediction. Examining various long-short term memory (LSTM) architectures for classification and regression on the MIT Interview and First Impressions Candidate Screening (FICS) datasets, we note that: (1) Multimodal approaches outperform unimodal counterparts; (2) Efficient trait predictions and plausible explanations are achieved with both unimodal and multimodal approaches, and (3) Following the thin-slice approach, effective trait prediction is achieved even from two-second behavioral snippets.
翻译:我们探索多模态行为线索在可解释地预测个性和面试特定特质方面的有效性。我们利用名为kinemes的基本头部运动单元、称为动作单元的原子面部运动以及语音特征来估计这些人类中心特质。实证结果证实,kinemes和动作单元能够发现多种特质特定行为,同时支持预测的可解释性。在融合线索方面,我们研究了决策级和特征级融合,以及一种基于加性注意力的融合策略,该策略量化了三种模态对特质预测的相对重要性。通过检查多种长短期记忆网络(LSTM)架构在MIT面试和第一印象候选筛选(FICS)数据集上的分类和回归性能,我们注意到:(1)多模态方法优于单模态方法;(2)单模态和多模态方法均能实现高效的特质预测和合理的解释;(3)遵循薄片方法,即使从两秒的行为片段中也能实现有效的特质预测。