Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.
翻译:理解人类活动及其周围环境通常依赖于视觉感知,然而摄像头在隐私、安全、能效及可扩展性方面始终面临挑战。我们探索一种替代方案:无需视觉的四维感知,其目标在于仅通过日常可穿戴传感器重建人体运动与三维场景布局。为此,我们提出IMU-to-4D框架,该框架将大型语言模型重新应用于人体-场景动态的非视觉时空理解。IMU-to-4D利用来自耳机、手表或智能手机中少数惯性传感器的数据,预测详细的四维人体运动及粗略场景结构。跨多个人体-场景数据集的实验表明,相较于最先进的级联式流水线,IMU-to-4D能生成更连贯且时间上更稳定的结果,这表明仅凭可穿戴运动传感器即可支持丰富的四维理解。