Automatic personality trait assessment is essential for high-quality human-machine interactions. Systems capable of human behavior analysis could be used for self-driving cars, medical research, and surveillance, among many others. We present a multimodal deep neural network with a Siamese extension for apparent personality trait prediction trained on short video recordings and exploiting modality invariant embeddings. Acoustic, visual, and textual information are utilized to reach high-performance solutions in this task. Due to the highly centralized target distribution of the analyzed dataset, the changes in the third digit are relevant. Our proposed method addresses the challenge of under-represented extreme values, achieves 0.0033 MAE average improvement, and shows a clear advantage over the baseline multimodal DNN without the introduced module.
翻译:自动人格特质评估对于高质量人机交互至关重要。能够进行人类行为分析的系统可应用于自动驾驶汽车、医学研究和监控等诸多领域。我们提出一种具有孪生扩展的多模态深度神经网络,该网络基于短视频记录训练并利用模态不变嵌入,用于显性人格特质预测。本研究通过融合声学、视觉和文本信息,在该任务中实现了高性能解决方案。由于所分析数据集的目标分布高度集中,第三位数字的变化具有相关性。我们提出的方法解决了极端值表征不足的挑战,平均绝对误差平均提升0.0033,且相比未引入该模块的基线多模态深度神经网络展现出明显优势。