Drivers' visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers' perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers' attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers' state measurement in intelligent vehicles.
翻译:驾驶员视觉注意力为潜在危险预判提供关键线索,并直接影响决策与控制行为,其缺失可能导致交通安全隐患。为模拟驾驶员感知模式并推动智能车辆视觉注意力预测发展,我们提出DiffAttn——一个基于扩散的框架,将该任务建模为条件扩散-去噪过程,从而更精准地建模驾驶员注意力。为同时捕捉局部与全局场景特征,我们采用Swin Transformer作为编码器,并设计结合特征融合金字塔(实现跨层交互)与密集多尺度条件扩散的解码器,以协同增强去噪学习并建模细粒度的局部与全局场景上下文。此外,引入大语言模型(LLM)层以强化自上而下的语义推理,提升对安全关键线索的敏感性。在四个公开数据集上的大量实验表明,DiffAttn达到了最先进的(SoTA)性能,超越了大多数基于视频、自上而下特征驱动及LLM增强的基线方法。该框架进一步支持可解释的以驾驶员为中心的场景理解,并有望改善智能座舱人机交互、风险感知及驾驶员状态测量。