Frozen Multimodal Embeddings for Personality and Cognitive Ability Assessment in Asynchronous Video Interviews

Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging multimodal learning problem because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1\% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

翻译：从异步视频面试（AVI）中预测心理特征是一项具有挑战性的多模态学习问题，因为标注数据集有限，而每个回答包含高维度的视觉、声学和言语信号。本文介绍了我们对ACM多媒体AVI挑战赛2026的解决方案，该挑战评估两个任务：赛道1从与个性相关的面试回答中预测自我报告的HEXACO人格特质，赛道2从结构化AVI回答中对认知能力水平进行分类。我们将该问题视为小样本表征学习任务。不同于微调大型预训练模型，我们采用冻结的多模态编码器，包括用于视觉特征的CLIP、用于声学特征和转录文本的Whisper，以及用于文本表征的RoBERTa、E5和DeBERTaV3，随后使用低容量下游模型。对于赛道1，我们的特质特定回归与后期融合系统实现了平均验证MSE为0.2696，较官方基线0.3334有所提升。消融实验结果显示，从全局模型（0.3189）到逐特质建模（0.2871），再到逐特质后期融合（0.2696）的三步改进，相对于官方基线实现了19.1%的MSE相对降低。对于赛道2，一个紧凑的主体属性基线达到了0.5781的准确率，而我们的多模态集成达到0.5313，均高于官方基线0.4062。我们将这一结果解释为验证划分中可能存在主体属性捷径的证据，而非从AVI内容中得出的稳健认知推断。总体而言，我们的发现表明，基于AVI的心理评估受益于特质特定的多模态建模，但认知能力预测需要仔细控制数据集捷径。