CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems

Accurate and early perception of potential intrusion targets is essential for ensuring the safety of railway transportation systems. However, most existing systems focus narrowly on object classification within fixed visual scopes and apply rule-based heuristics to determine intrusion status, often overlooking targets that pose latent intrusion risks. Anticipating such risks requires the cognition of spatial context and temporal dynamics for the object of interest (OOI), which presents challenges for conventional visual models. To facilitate deep intrusion perception, we introduce a novel benchmark, CogRail, which integrates curated open-source datasets with cognitively driven question-answer annotations to support spatio-temporal reasoning and prediction. Building upon this benchmark, we conduct a systematic evaluation of state-of-the-art visual-language models (VLMs) using multimodal prompts to identify their strengths and limitations in this domain. Furthermore, we fine-tune VLMs for better performance and propose a joint fine-tuning framework that integrates three core tasks, position perception, movement prediction, and threat analysis, facilitating effective adaptation of general-purpose foundation models into specialized models tailored for cognitive intrusion perception. Extensive experiments reveal that current large-scale multimodal models struggle with the complex spatial-temporal reasoning required by the cognitive intrusion perception task, underscoring the limitations of existing foundation models in this safety-critical domain. In contrast, our proposed joint fine-tuning framework significantly enhances model performance by enabling targeted adaptation to domain-specific reasoning demands, highlighting the advantages of structured multi-task learning in improving both accuracy and interpretability. Code will be available at https://github.com/Hub-Tian/CogRail.

翻译：准确且及早地感知潜在的入侵目标对于确保铁路运输系统的安全至关重要。然而，现有系统大多局限于固定视野内的目标分类，并应用基于规则的启发式方法判断入侵状态，常常忽视那些构成潜在入侵风险的目标。预测此类风险需要对感兴趣目标（OOI）的空间上下文和时间动态进行认知，这对传统视觉模型提出了挑战。为促进深度入侵感知，我们引入了一个新颖的基准测试CogRail，它整合了精选的开源数据集与认知驱动的问答标注，以支持时空推理与预测。基于此基准，我们使用多模态提示对前沿的视觉语言模型（VLMs）进行了系统性评估，以识别它们在该领域的优势与局限。此外，我们对VLMs进行了微调以提升性能，并提出了一种联合微调框架，该框架整合了位置感知、运动预测和威胁分析三个核心任务，促进了通用基础模型向认知入侵感知专用模型的有效适配。大量实验表明，当前的大规模多模态模型难以应对认知入侵感知任务所需的复杂时空推理，凸显出现有基础模型在这一安全关键领域的局限性。相比之下，我们提出的联合微调框架通过使模型能够针对领域特定的推理需求进行定向适配，显著提升了模型性能，突显了结构化多任务学习在提升准确性和可解释性方面的优势。代码将在 https://github.com/Hub-Tian/CogRail 提供。