Distracted driving is a major cause of traffic collisions, calling for robust and scalable detection methods. Vision-language models (VLMs) enable strong zero-shot image classification, but existing VLM-based distracted driver detectors often underperform in real-world conditions. We identify subject-specific appearance variations (e.g., clothing, age, and gender) as a key bottleneck: VLMs entangle these factors with behavior cues, leading to decisions driven by who the driver is rather than what the driver is doing. To address this, we propose a subject decoupling framework that extracts a driver appearance embedding and removes its influence from the image embedding prior to zero-shot classification, thereby emphasizing distraction-relevant evidence. We further orthogonalize text embeddings via metric projection onto Stiefel manifold to improve separability while staying close to the original semantics. Experiments demonstrate consistent gains over prior baselines, indicating the promise of our approach for practical road-safety applications.
翻译:分心驾驶是导致交通事故的主要原因,亟需开发鲁棒且可扩展的检测方法。视觉语言模型(VLMs)能够实现强大的零样本图像分类,但现有基于VLM的分心驾驶检测器在实际场景中往往表现欠佳。我们发现主体特异性外观差异(如衣着、年龄和性别)是关键的瓶颈:VLM将这些因素与行为线索纠缠在一起,导致决策受驾驶员身份而非驾驶行为主导。为解决此问题,我们提出一种主体解耦框架,该框架提取驾驶员外观嵌入表示,并在零样本分类前从图像嵌入中消除其影响,从而强化与分心行为相关的证据。我们进一步通过施蒂费尔流形上的度量投影对文本嵌入进行正交化处理,在保持原始语义邻近性的同时提升特征可分性。实验结果表明,该方法相较于现有基线模型取得了一致的性能提升,展现了其在道路安全实际应用中的潜力。