Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain healthy habits that improve medical outcomes. In-person interventions are costly and difficult to scale, especially in resource-limited regions. Digital health interventions offer a cost-effective approach, potentially supporting independent living and self-management. Automating such interventions, especially through machine learning, has gained considerable attention recently. Ambivalence and hesitancy (A/H) play a primary role for individuals to delay, avoid, or abandon health interventions. A/H are subtle and conflicting emotions that place a person in a state between positive and negative evaluations of a behaviour, or between acceptance and refusal to engage in it. They manifest as affective inconsistency across modalities or within a modality, such as language, facial, vocal expressions, and body language. While experts can be trained to recognize A/H, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital health interventions. Here, we explore the application of deep learning models for A/H recognition in videos, a multi-modal task by nature. In particular, this paper covers three learning setups: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference via large language models (LLMs). Our experiments are conducted on the unique and recently published BAH video dataset for A/H recognition. Our results show limited performance, suggesting that more adapted multi-modal models are required for accurate A/H recognition. Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities.
翻译:基于行为科学,健康干预通过提供框架帮助患者建立并维持改善医疗结局的健康习惯,聚焦行为改变。面对面干预成本高昂且难以规模化,尤其在资源有限地区。数字健康干预提供了一种经济有效的方法,可能支持独立生活与自我管理。近年来,通过机器学习实现此类干预的自动化备受关注。矛盾与犹豫情绪在个体延迟、回避或放弃健康干预中起核心作用。矛盾/犹豫是一种微妙且冲突的情绪状态,使个体处于对行为的正面与负面评价之间,或参与意愿的接受与拒绝之间。这类情绪表现为跨模态或单模态(如语言、面部表情、声音表达及肢体语言)的情感不一致性。尽管专家可通过培训识别矛盾/犹豫,将其整合至数字健康干预成本高且效果有限。因此,自动识别矛盾/犹豫对实现数字健康干预的个性化与成本效益至关重要。本研究探索了深度学习模型在视频中识别矛盾/犹豫的应用——这本质上是一个多模态任务。具体而言,本文涵盖三种学习范式:监督学习、面向个性化的无监督域适应,以及通过大语言模型实现的零样本推理。实验基于近期发布的BAH矛盾/犹豫识别专用视频数据集进行。结果显示模型性能有限,表明准确识别矛盾/犹豫需要更适配的多模态模型。为利用模态内/跨模态的冲突信息,亟需更优的时空建模与多模态融合方法。