Social interactions are fundamental to well-being, yet automatically detecting them in daily life-particularly using wearables-remains underexplored. Most existing systems are evaluated in controlled settings, focus primarily on in-person interactions, or rely on restrictive assumptions (e.g., requiring multiple speakers within fixed temporal windows), limiting generalizability to real-world use. We present an on-watch interaction detection system designed to capture diverse interactions in naturalistic settings. A core component is a foreground speech detector trained on a public dataset. Evaluated on over 100,000 labeled foreground speech and background sound instances, the detector achieves a balanced accuracy of 85.51%, outperforming prior work by 5.11%. We evaluated the system in a real-world deployment (N=38), with over 900 hours of total smartwatch wear time. The system detected 1,691 interactions, 77.28% were confirmed via participant self-report, with durations ranging from under one minute to over one hour. Among correct detections, 81.45% were in-person, 15.7% virtual, and 1.85% hybrid. Leveraging participant-labeled data, we further developed a multimodal model achieving a balanced accuracy of 90.36% and a sensitivity of 91.17% on 33,698 labeled 15-second windows. These results demonstrate the feasibility of real-world interaction sensing and open the door to adaptive, context-aware systems responding to users' dynamic social environments.
翻译:社交交互是身心健康的基础,然而在日常生活中自动检测社交交互——尤其是利用可穿戴设备——仍是一个尚未充分探索的领域。现有系统大多在受控环境中进行评估,主要关注面对面交互,或依赖于限制性假设(例如要求多个说话者在固定时间窗口内发声),这限制了其在现实世界应用中的泛化能力。我们提出一种手表端交互检测系统,旨在捕捉自然场景下的多样化交互。其核心组件是一个基于公共数据集训练的前景语音检测器。该检测器在超过10万个已标注的前景语音与背景声音实例上进行了评估,取得了85.51%的平衡准确率,较先前工作提升了5.11%。我们在一次真实世界部署(N=38)中评估了该系统,总智能手表佩戴时间超过900小时。系统共检测到1,691次交互,其中77.28%通过参与者自我报告得到确认,交互时长从不足一分钟到超过一小时不等。在正确检测中,81.45%为面对面交互,15.7%为虚拟交互,1.85%为混合交互。利用参与者标注的数据,我们进一步开发了一个多模态模型,在33,698个已标注的15秒时间窗口上实现了90.36%的平衡准确率和91.17%的灵敏度。这些结果证明了现实世界交互感知的可行性,并为开发能够响应用户动态社交环境的自适应、情境感知系统打开了大门。