Intelligent anomaly detection in dynamic visual environments requires reconciling real-time performance with semantic interpretability. Conventional approaches address only fragments of this challenge. Reconstruction-based models capture low-level deviations without contextual reasoning, object detectors provide speed but limited semantics, and large vision-language systems deliver interpretability at prohibitive computational cost. This work introduces a cascading multi-agent framework that unifies these complementary paradigms into a coherent and interpretable architecture. Early modules perform reconstruction-gated filtering and object-level assessment, while higher-level reasoning agents are selectively invoked to interpret semantically ambiguous events. The system employs adaptive escalation thresholds and a publish-subscribe communication backbone, enabling asynchronous coordination and scalable deployment across heterogeneous hardware. Extensive evaluation on large-scale monitoring data demonstrates that the proposed cascade achieves a threefold reduction in latency compared to direct vision-language inference, while maintaining high perceptual fidelity (PSNR = 38.3 dB, SSIM = 0.965) and consistent semantic labeling. The framework advances beyond conventional detection pipelines by combining early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, establishing a reproducible and energy-efficient foundation for scalable intelligent visual monitoring.
翻译:动态视觉环境中的智能异常检测需要在实时性能与语义可解释性之间取得平衡。传统方法仅能应对这一挑战的局部。基于重建的模型捕获低级偏差但缺乏上下文推理,目标检测器速度快但语义有限,而大型视觉语言系统虽能提供可解释性却伴随难以承受的计算成本。本研究提出一种级联多智能体框架,将这些互补范式统一为连贯且可解释的体系结构。早期模块执行重建门控滤波与目标级评估,而高层推理智能体则被选择性调用以解释语义模糊事件。该系统采用自适应升级阈值与发布-订阅通信骨干,支持异构硬件间的异步协调与可扩展部署。在大规模监控数据上的广泛评估表明,所提出的级联架构相比直接视觉语言推理实现了三倍的延迟降低,同时保持高感知保真度(PSNR = 38.3 dB,SSIM = 0.965)与一致的语义标注。该框架通过结合早期退出效率、自适应多智能体推理与可解释的异常归因,超越了传统检测流程,为可扩展的智能视觉监控建立了可复现且高能效的基础。