Multimodal Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions

Manuela González-González,Soufiane Belharbi,Muhammad Osama Zeeshan,Masoumeh Sharafi,Muhammad Haseeb Aslam,Lorenzo Sia,Nicolas Richet,Marco Pedersoli,Alessandro Lameiras Koerich,Simon L Bacon,Eric Granger

from arxiv, 11 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:2505.19328

Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain healthy habits that improve medical outcomes. In-person interventions are costly and difficult to scale, especially in resource-limited regions. Digital health interventions offer a cost-effective approach, potentially supporting independent living and self-management. Automating such interventions, especially through machine learning, has gained considerable attention recently. Ambivalence and hesitancy (A/H) play a primary role for individuals to delay, avoid, or abandon health interventions. A/H are subtle and conflicting emotions that place a person in a state between positive and negative evaluations of a behaviour, or between acceptance and refusal to engage in it. They manifest as affective inconsistency across modalities or within a modality, such as language, facial, vocal expressions, and body language. While experts can be trained to recognize A/H, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital health interventions. Here, we explore the application of deep learning models for A/H recognition in videos, a multi-modal task by nature. In particular, this paper covers three learning setups: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference via large language models (LLMs). Our experiments are conducted on the unique and recently published BAH video dataset for A/H recognition. Our results show limited performance, suggesting that more adapted multi-modal models are required for accurate A/H recognition. Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities.

翻译：基于行为科学，健康干预通过提供框架帮助患者养成并维持改善医疗结局的健康习惯，重点关注行为改变。面对面干预成本高昂且难以规模化，尤其是在资源受限地区。数字健康干预提供了一种低成本方案，有望支持独立生活与自我管理。近年来，通过机器学习实现此类干预的自动化已获得广泛关注。矛盾与犹豫（A/H）是导致个体延迟、回避或放弃健康干预的主要因素。A/H是一种微妙而冲突的情绪，使个体处于对行为的积极与消极评价之间，或接受与拒绝参与之间的状态。它们表现为跨模态或模态内的情感不一致性，例如语言、面部表情、声音表达及肢体语言。尽管专家可通过培训识别A/H，但将其整合至数字健康干预成本高昂且效果不彰。因此，自动A/H识别对于实现数字健康干预的个性化和成本效益至关重要。本文探索了深度学习模型在视频中识别A/H的应用——这一任务本质上是多模态的。具体而言，本文涵盖三种学习范式：监督学习、面向个性化的无监督域适应，以及通过大语言模型（LLMs）实现的零样本推理。我们在近期发布的独特BAH视频数据集上开展实验。结果显示模型性能有限，表明精确识别A/H需要更适配的多模态模型。为利用模态内/跨模态冲突，必须开发更优的时空建模与多模态融合方法。