Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability

Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging problem in AI-assisted interview assessment because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

翻译：从异步视频面试中预测心理特质是AI辅助面试评估中的一项挑战性任务，因为标注数据集有限，而每个回答包含高维度的视觉、声学和语言信号。本文提出了针对2026年ACM多媒体AVI挑战赛的解决方案，该挑战赛评估两个任务：赛道1从与性格相关的面试回答中预测自我报告的HEXACO人格特质，赛道2从结构化AVI回答中对认知能力水平进行分类。我们将该问题视为小样本表示学习任务。不同于微调大型预训练模型，我们使用冻结的多模态编码器，包括用于视觉特征的CLIP、用于声学特征和转录文本的Whisper，以及用于文本表示的RoBERTa、E5和DeBERTaV3，随后使用低容量下游模型。对于赛道1，我们的特质特定回归与晚期融合系统实现了平均验证MSE为0.2696，相较于官方基线0.3334有所提升。消融实验结果显示，从全局模型（0.3189）到逐特质建模（0.2871），再到逐特质晚期融合（0.2696），实现了三步改进，相较于官方基线相对MSE降低了19.1%。对于赛道2，一个紧凑的主客体属性基线达到了0.5781的准确率，而我们的多模态集成达到了0.5313，两者均高于官方基线0.4062。我们将这一结果解释为验证集划分中可能存在主客体属性捷径的证据，而非基于AVI内容的稳健认知推理。总体而言，我们的研究结果表明，基于AVI的心理评估受益于特质特定的多模态建模，但认知能力预测需要仔细控制数据集中的捷径。