Out-of-Sight Embodied Agents: Multimodal Tracking, Sensor Fusion, and Trajectory Forecasting

Trajectory prediction is a fundamental problem in computer vision, vision-language-action models, world models, and autonomous systems, with broad impact on autonomous driving, robotics, and surveillance. However, most existing methods assume complete and clean observations, and therefore do not adequately handle out-of-sight agents or noisy sensing signals caused by limited camera coverage, occlusions, and the absence of ground-truth denoised trajectories. These challenges raise safety concerns and reduce robustness in real-world deployment. In this extended study, we introduce major improvements to Out-of-Sight Trajectory (OST), a task for predicting noise-free visual trajectories of out-of-sight objects from noisy sensor observations. Building on our prior work, we expand Out-of-Sight Trajectory Prediction (OOSTraj) from pedestrians to both pedestrians and vehicles, increasing its relevance to autonomous driving, robotics, and surveillance. Our improved Vision-Positioning Denoising Module exploits camera calibration to establish vision-position correspondence, mitigating the lack of direct visual cues and enabling effective unsupervised denoising of noisy sensor signals. Extensive experiments on the Vi-Fi and JRDB datasets show that our method achieves state-of-the-art results for both trajectory denoising and trajectory prediction, with clear gains over prior baselines. We also compare with classical denoising methods, including Kalman filtering, and adapt recent trajectory prediction models to this setting, establishing a stronger benchmark. To the best of our knowledge, this is the first work to use vision-positioning projection to denoise noisy sensor trajectories of out-of-sight agents, opening new directions for future research.

翻译：轨迹预测是计算机视觉、视觉-语言-动作模型、世界模型及自主系统中的基础问题，对自动驾驶、机器人和监控领域具有广泛影响。然而，现有方法大多假设完整且干净的观测数据，因此无法充分处理由有限摄像头覆盖范围、遮挡以及缺乏真实去噪轨迹所导致的不可见智能体或噪声传感信号。这些挑战在实际部署中引发安全隐患并降低鲁棒性。在本扩展研究中，我们对不可见轨迹（OST）任务进行了重大改进，该任务旨在从含噪传感器观测中预测不可见物体的无噪声视觉轨迹。基于先前工作，我们将不可见轨迹预测（OOSTraj）从行人扩展到行人和车辆，增强了其与自动驾驶、机器人和监控领域的关联性。我们改进的视觉定位去噪模块利用相机标定建立视觉-位置对应关系，缓解了直接视觉线索缺失的问题，并实现了对含噪传感器信号的无监督有效去噪。在Vi-Fi和JRDB数据集上的大量实验表明，我们的方法在轨迹去噪和轨迹预测两项任务上均达到最优结果，相较于现有基线模型具有明显优势。我们还与经典去噪方法（包括卡尔曼滤波）进行了对比，并适配了最新轨迹预测模型到该场景，建立了更强的基准。据我们所知，这是首次利用视觉-位置投影对不可见智能体的含噪传感器轨迹进行去噪的研究，为未来研究开辟了新方向。