Autonomous vehicles (AVs) rely on multi-modal fusion for safety, but current visual and optical sensors fail to detect road-induced excitations which are critical for vehicles' dynamic control. Inspired by human synesthesia, we propose the Synesthesia of Vehicles (SoV), a novel framework to predict tactile excitations from visual inputs for autonomous vehicles. We develop a cross-modal spatiotemporal alignment method to address temporal and spatial disparities. Furthermore, a visual-tactile synesthetic (VTSyn) generative model using latent diffusion is proposed for unsupervised high-quality tactile data synthesis. A real-vehicle perception system collected a multi-modal dataset across diverse road and lighting conditions. Extensive experiments show that VTSyn outperforms existing models in temporal, frequency, and classification performance, enhancing AV safety through proactive tactile perception.
翻译:自动驾驶车辆依赖多模态融合确保安全,但当前视觉与光学传感器无法检测对车辆动态控制至关重要的路面激励。受人类联觉现象启发,我们提出车辆联觉框架,这是一种基于视觉输入预测自动驾驶车辆触觉激励的新型框架。我们开发了跨模态时空对齐方法以解决时空差异问题。此外,提出采用隐空间扩散的视觉-触觉联觉生成模型,用于无监督的高质量触觉数据合成。通过真实车辆感知系统采集了涵盖不同道路与光照条件的多模态数据集。大量实验表明,该模型在时域、频域及分类性能上均优于现有模型,通过前瞻性触觉感知提升了自动驾驶安全性。