Self-driving cars increasingly rely on deep neural networks to achieve human-like driving. The opacity of such black-box planners makes it challenging to accurately anticipate when they will fail, with potentially catastrophic consequences. While research into interpreting these systems has surged, most of it is confined to simulations or toy setups due to the difficulty of real-world deployment, leaving the practical utility of such techniques unknown. Here, we introduce the Concept-Wrapper Network (CW-Net), a method for faithfully explaining the behavior of machine-learning-based planners that causally grounds their reasoning in human-interpretable concepts without sacrificing performance. We deploy CW-Net on a real self-driving car and show that the resulting explanations improve the human driver's mental model of the vehicle, allowing them to better predict its behavior, particularly in surprising situations. This demonstrates that explainable deep learning integrated into self-driving cars can be both understandable and useful in a realistic deployment setting. We anticipate our method could be applied to other safety-critical systems, such as autonomous drones and robotic surgeons, as well as to other architectures, such as end-to-end learning systems and vision-language-action models. Overall, our study establishes a deployment-validated pathway to interpretability for autonomous agents, which could help make them more transparent and safe.
翻译:自动驾驶汽车日益依赖深度神经网络来实现类人驾驶。这类黑箱规划器的不可解释性使得人们难以准确预测其何时会失效,从而可能造成灾难性后果。尽管针对这些系统的解释性研究蓬勃发展,但由于现实部署的困难,大部分研究仍局限于仿真或简化场景,导致此类技术的实际效用尚不明确。本文提出概念封装网络(CW-Net),这是一种能够忠实解释基于机器学习规划器行为的方法,在不牺牲性能的前提下,将其推理过程因果性地锚定在人类可理解的概念上。我们在真实自动驾驶汽车上部署了CW-Net,并证明由此产生的解释能改善人类驾驶员对车辆的心理模型,使其能更准确地预测车辆行为,尤其在意外场景中表现显著。这表明,集成在自动驾驶汽车中的可解释深度学习在现实部署场景中既可理解又具实用性。我们预计,该方法可推广至其他安全关键系统(如自主无人机和机器人手术系统),以及不同架构(如端到端学习系统和视觉-语言-动作模型)。总体而言,本研究为自主智能体的可解释性建立了一条经过部署验证的路径,有助于提升其透明性与安全性。