Deploying human activity recognition (HAR) at home is still rare because sensor signals vary wildly across houses, people, and time, essentially requiring in-situ data collection and training. Prior approaches use cameras to generate training labels for privacy-preserving sensors (LiDAR, RADAR, Thermal), but this forces sensors to detect predefined activities that cameras can see yet the sensors themselves cannot reliably distinguish. In this work, we introduce OrganicHAR, an activity discovery framework that inverts this relationship by placing sensor capabilities at the center of activity discovery. Our approach identifies naturally occurring signal patterns using privacy-preserving sensors, leverages Vision Language Models (VLMs) only during these key moments for scene understanding, and discovers discrete activity labels at granularities that these sensors can reliably detect. Our evaluation with 12 participants demonstrates OrganicHAR's effectiveness: it achieves 79% accuracy for coarse (4-5) activities using only basic ambient sensors (radar, lidar, thermal arrays), and 73% accuracy for fine-grained (8-9) activities when a wearable IMU, depth, and pose sensor are added. OrganicHAR maintains 77% accuracy on average across configurations while discovering 4-8 categories per user (15 across all users) tailored to each environment and sensor capabilities. By triggering video processing only at key moments identified by local sensors, we reduce queries to VLM by 90%, enabling practical and privacy-preserving activity recognition in natural settings.
翻译:在家庭环境中部署人类活动识别(HAR)仍较为罕见,因为传感器信号在不同房屋、人群和时间段内差异显著,本质上需要实地数据采集和训练。现有方法利用摄像头为隐私保护传感器(激光雷达、毫米波雷达、热成像仪)生成训练标签,但这迫使传感器检测摄像头可见、而传感器自身无法可靠区分的预定义活动。本研究提出OrganicHAR活动发现框架,通过将传感器能力置于活动发现核心来反转这一关系。该方法利用隐私保护传感器识别自然发生的信号模式,仅在这些关键时刻借助视觉语言模型(VLM)进行场景理解,并以传感器可可靠检测的粒度发现离散活动标签。基于12名参与者的评估证实了OrganicHAR的有效性:仅使用基础环境传感器(雷达、激光雷达、热成像阵列)时,对4-5类粗粒度活动识别准确率达79%;加入可穿戴惯性测量单元(IMU)、深度和姿态传感器后,对8-9类细粒度活动准确率达73%。在不同配置下,OrganicHAR平均保持77%的识别准确率,并为每位用户发现4-8类活动类别(全体用户共15类),这些类别根据各环境与传感器能力定制。通过仅在本地传感器识别的关键时刻触发视频处理,我们将VLM查询次数减少90%,从而在自然场景中实现实用且保护隐私的活动识别。