BATON: A Multimodal Benchmark for Bidirectional Automation Transition Observation in Naturalistic Driving

Existing driving automation (DA) systems on production vehicles rely on human drivers to decide when to engage DA while requiring them to remain continuously attentive and ready to intervene. This design demands substantial situational judgment and imposes significant cognitive load, leading to steep learning curves, suboptimal user experience, and safety risks from both over-reliance and delayed takeover. Predicting when drivers hand over control to DA and when they take it back is therefore critical for designing proactive, context-aware HMI, yet existing datasets rarely capture the multimodal context, including road scene, driver state, vehicle dynamics, and route environment. To fill this gap, we introduce BATON, a large-scale naturalistic dataset capturing real-world DA usage across 127 drivers, and 136.6 hours of driving. The dataset synchronizes front-view video, in-cabin video, decoded CAN bus signals, radar-based lead-vehicle interaction, and GPS-derived route context, forming a closed-loop multimodal record around each control transition. We define three benchmark tasks: driving action understanding, handover prediction, and takeover prediction, and evaluate baselines spanning sequence models, classical classifiers, and zero-shot VLMs. Results show that visual input alone is insufficient for reliable transition prediction: front-view video captures road context but not driver state, while in-cabin video reflects driver readiness but not the external scene. Incorporating CAN and route-context signals substantially improves performance over video-only settings, indicating strong complementarity across modalities. We further find takeover events develop more gradually and benefit from longer prediction horizons, whereas handover events depend more on immediate contextual cues, revealing an asymmetry with direct implications for HMI design in assisted driving systems.

翻译：现有量产车辆上的驾驶自动化系统依赖人类驾驶员决定何时启用自动驾驶，同时要求驾驶员保持持续注意力并随时准备接管。这种设计需要驾驶员具备较强的情景判断能力，并施加显著认知负荷，导致学习曲线陡峭、用户体验欠佳，且因过度信任和延迟接管引发安全风险。因此，预测驾驶员何时将控制权交给自动驾驶系统、何时重新接管控制权，对于设计主动式、情境感知的人机交互界面至关重要。然而，现有数据集极少捕捉包含道路场景、驾驶员状态、车辆动力学和路径环境在内的多模态上下文信息。为填补这一空白，我们提出了BATON——一个覆盖127名驾驶员、总计136.6小时真实驾驶场景的大规模自然驾驶数据集。该数据集同步采集了前视摄像头视频、座舱内视频、解码后的CAN总线信号、基于雷达的前方目标车交互信息以及GPS导出的路径上下文，围绕每次控制权切换构建了闭环多模态记录。我们定义了三个基准任务：驾驶行为理解、切换至自动驾驶预测和接管预测，并评估了包括序列模型、经典分类器和零样本视觉语言模型在内的基线方法。结果表明：仅依赖视觉输入不足以实现可靠的切换预测——前视视频可捕捉道路上下文但无法反映驾驶员状态，而座舱内视频能反映驾驶员准备程度却无法呈现外部场景。融合CAN信号与路径上下文信息相较仅使用视觉输入显著提升了性能表现，揭示出不同模态间的强互补性。我们进一步发现，接管事件的发展过程更为渐进，且受益于更长的预测时间窗口；而切换至自动驾驶事件更依赖即时上下文线索，这种非对称性对辅助驾驶系统的人机交互界面设计具有直接启示意义。