Audio-Visual Deception Detection: DOLOS Dataset and Parameter-Efficient Crossmodal Learning

Deception detection in conversations is a challenging yet important task, having pivotal applications in many fields such as credibility assessment in business, multimedia anti-frauds, and custom security. Despite this, deception detection research is hindered by the lack of high-quality deception datasets, as well as the difficulties of learning multimodal features effectively. To address this issue, we introduce DOLOS\footnote {The name ``DOLOS" comes from Greek mythology.}, the largest gameshow deception detection dataset with rich deceptive conversations. DOLOS includes 1,675 video clips featuring 213 subjects, and it has been labeled with audio-visual feature annotations. We provide train-test, duration, and gender protocols to investigate the impact of different factors. We benchmark our dataset on previously proposed deception detection approaches. To further improve the performance by fine-tuning fewer parameters, we propose Parameter-Efficient Crossmodal Learning (PECL), where a Uniform Temporal Adapter (UT-Adapter) explores temporal attention in transformer-based architectures, and a crossmodal fusion module, Plug-in Audio-Visual Fusion (PAVF), combines crossmodal information from audio-visual features. Based on the rich fine-grained audio-visual annotations on DOLOS, we also exploit multi-task learning to enhance performance by concurrently predicting deception and audio-visual features. Experimental results demonstrate the desired quality of the DOLOS dataset and the effectiveness of the PECL. The DOLOS dataset and the source codes are available at https://github.com/NMS05/Audio-Visual-Deception-Detection-DOLOS-Dataset-and-Parameter-Efficient-Crossmodal-Learning/tree/main.

翻译：对话中的欺骗检测是一项具有挑战性但至关重要的任务，在商业可信度评估、多媒体防欺诈和定制安全等多个领域具有关键应用。然而，欺骗检测研究因缺乏高质量的欺骗数据集以及有效学习多模态特征的困难而受到阻碍。为解决这一问题，我们引入了DOLOS\footnote{名称“DOLOS”源自希腊神话}，这是最大的游戏节目欺骗检测数据集，包含丰富的欺骗性对话。DOLOS包含1,675个视频片段，涉及213名被试，并标注了音频-视觉特征。我们提供了训练-测试、时长和性别协议，以探究不同因素的影响。我们基于先前提出的欺骗检测方法对数据集进行了基准测试。为进一步通过微调更少参数提升性能，我们提出了参数高效跨模态学习（PECL），其中统一时间适配器（UT-Adapter）探索基于Transformer架构中的时间注意力，而跨模态融合模块——插件式音频-视觉融合（PAVF）——则结合了来自音频-视觉特征的跨模态信息。基于DOLOS上丰富的细粒度音频-视觉标注，我们还利用多任务学习，通过同时预测欺骗和音频-视觉特征来增强性能。实验结果表明了DOLOS数据集的理想质量以及PECL的有效性。DOLOS数据集和源代码可在https://github.com/NMS05/Audio-Visual-Deception-Detection-DOLOS-Dataset-and-Parameter-Efficient-Crossmodal-Learning/tree/main获取。