Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making

翻译：内外观听：用于驾驶员安全评估与智能车辆决策的多模态人工智能系统

Ross Greer,Laura Fleig,Maitrayee Keskar,Erika Maquiling,Giovanni Tapia Lopez,Angel Martinez-Sanchez,Parthib Roy,Jake Rattigan,Mira Sur,Alejandra Vidrio,Thomas Marcotte,Mohan Trivedi

The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g., "turn after that red building") to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.

翻译：内外观察（LILO）框架通过同时理解外部场景与驾驶员状态来提升安全性能，已成功应用于智能安全气囊部署、自动驾驶控制权交接时的接管时间预测以及驾驶员注意力监测等领域。本研究对该框架进行扩展，论证音频模态可作为理解驾驶员状态的重要信息源，并在不断演进的自动驾驶环境中，进一步扩展至乘客及车外人员。我们通过融合音频信号拓展LILO框架，构建"内外观听"（L-LIO）多模态感知体系，以增强驾驶员状态评估与环境理解能力。本文评估了三个音频提升车辆安全性的典型案例：基于驾驶员语音音频的监督学习以分类潜在异常状态（如酒驾）；采集分析乘客自然语言指令（例如"在那栋红房子后转弯"）以探索语音如何通过音频对齐的指令数据与规划系统交互；以及纯视觉系统在外部智能体引导与手势识别中的局限性案例，说明音频可消除歧义。数据集包含在真实环境中定制采集的车内与外部音频样本。初步研究表明，音频能提供关键安全洞察，尤其在声音对安全决策至关重要或视觉信号不足的复杂场景中。当前挑战包括环境噪声干扰、隐私保护问题以及跨被试鲁棒性，这推动着动态现实场景下系统可靠性的进一步研究。L-LIO框架通过视听多模态融合增强了对驾驶员与场景的理解能力，为安全干预提供了新的技术路径。