Deep machine learning models including Convolutional Neural Networks (CNN) have been successful in the detection of Mild Cognitive Impairment (MCI) using medical images, questionnaires, and videos. This paper proposes a novel Multi-branch Classifier-Video Vision Transformer (MC-ViViT) model to distinguish MCI from those with normal cognition by analyzing facial features. The data comes from the I-CONECT, a behavioral intervention trial aimed at improving cognitive function by providing frequent video chats. MC-ViViT extracts spatiotemporal features of videos in one branch and augments representations by the MC module. The I-CONECT dataset is challenging as the dataset is imbalanced containing Hard-Easy and Positive-Negative samples, which impedes the performance of MC-ViViT. We propose a loss function for Hard-Easy and Positive-Negative Samples (HP Loss) by combining Focal loss and AD-CORRE loss to address the imbalanced problem. Our experimental results on the I-CONECT dataset show the great potential of MC-ViViT in predicting MCI with a high accuracy of 90.63\% accuracy on some of the interview videos.
翻译:深度机器学习模型,包括卷积神经网络(CNN),已成功通过医学图像、问卷和视频检测轻度认知障碍(MCI)。本文提出一种新颖的多分支分类器-视频视觉变换器(MC-ViViT)模型,通过分析面部特征区分MCI与正常认知个体。数据来源于I-CONECT行为干预试验,该试验旨在通过提供频繁视频聊天来改善认知功能。MC-ViViT在一个分支中提取视频的时空特征,并通过MC模块增强表示。I-CONECT数据集具有挑战性,因其存在难易样本和正负样本不平衡,这阻碍了MC-ViViT的性能。我们提出一种针对难易样本和正负样本的损失函数(HP Loss),通过结合Focal loss和AD-CORRE loss来解决不平衡问题。我们在I-CONECT数据集上的实验结果表明,MC-ViViT在预测MCI方面具有巨大潜力,在某些访谈视频上达到了90.63%的高准确率。