Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, mainly due to recent advances in deep learning and AI. As interactions between autonomous systems and humans increase, the interpretability of decision-making processes in driving systems becomes increasingly crucial for ensuring safe driving operations. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the eXplainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver's decisions. These explanations are derived from both the driver's eye-gaze and the ego-vehicle's perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatio-temporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on the DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability than conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations. Our data, code and models are available at: https://mukil07.github.io/VCBM.github.io/
翻译:自动驾驶系统正日益能够处理复杂任务,这主要得益于深度学习与人工智能领域的最新进展。随着自主系统与人类交互的日益频繁,驾驶系统决策过程的可解释性对于确保安全驾驶操作变得愈发关键。成功的人机交互需要理解环境与驾驶任务的潜在表征,而这在基于深度学习的系统中仍是一个重大挑战。为此,我们提出了在机动行为发生前进行可解释性预测以保障驾驶员安全的任务——即驾驶员意图预测,该任务在自动驾驶系统中具有关键作用。为促进可解释性驾驶员意图预测的研究,我们构建了可解释驾驶行为预期数据集——一种新型多模态、以自我为中心的视频数据集,旨在为驾驶员的决策提供分层的高层级文本解释作为因果推理。这些解释同时来源于驾驶员的视线注视与自我车辆的感知视角。进一步,我们提出了视频概念瓶颈模型——一种能够本质上生成时空连贯解释的框架,无需依赖事后解释技术。最后,通过对所提视频概念瓶颈模型在可解释驾驶行为预期数据集上的广泛评估,我们证明了基于Transformer的模型比传统的基于卷积神经网络的模型展现出更强的可解释性。此外,我们引入了一种多标签t-SNE可视化技术,用以阐明多重解释之间的解耦关系与因果关联。我们的数据、代码与模型已公开于:https://mukil07.github.io/VCBM.github.io/