Multimodal large language models (MLLMs) have shown strong capability in semantic understanding and visual reasoning, yet their use on continuous video streams in bandwidth-constrained edge-cloud systems incurs prohibitive computation and communication overhead and hinders low-latency alerting and effective visual evidence delivery. To address this challenge, we propose DAT to achieve high-quality semantic generation, low-latency event alerting, and effective visual evidence supplementation. To reduce unnecessary deep reasoning costs, we propose a collaborative small-large model cascade. A lightweight edge-side small model acts as a gating module to filter non-target-event frames and perform object detection, triggering MLLM inference only for suspicious frames. Building on this, we introduce an efficient fine-tuning strategy with visual guidance and semantic prompting, which improves structured event understanding, object detection, and output consistency. To ensure low-latency semantic alerting and effective visual evidence supplementation under bandwidth constraints, we further devise a semantics and bandwidth-aware multi-stream adaptive transmission optimization method. Experimental results show that DAT achieves 98.83% recognition accuracy and 100% output consistency. Under severe congestion, it reduces weighted semantic alert delay by up to 77.5% and delivers 98.33% of visual evidence within 0.5 s, demonstrating the effectiveness of jointly optimizing cascade inference and elastic transmission.
翻译:多模态大语言模型在语义理解与视觉推理方面展现了强大能力,但将其应用于带宽受限的边云系统处理连续视频流时,会带来难以承受的计算与通信开销,阻碍低延迟预警和有效视觉证据的传递。为应对这一挑战,我们提出DAT方法,以实现高质量语义生成、低延迟事件预警和有效视觉证据补充。为降低不必要的深度推理成本,提出一种协作式小-大规模模型级联方案:轻量级边缘端小模型作为门控模块,过滤非目标事件帧并执行目标检测,仅对可疑帧触发多模态大模型推理。在此基础上,引入融合视觉引导与语义提示的高效微调策略,提升结构化事件理解、目标检测及输出一致性。为在带宽约束下实现低延迟语义预警和有效视觉证据补充,进一步设计了一种语义与带宽感知的多流自适应传输优化方法。实验结果表明,DAT实现了98.83%的识别准确率和100%的输出一致性;在严重拥塞条件下,加权语义预警延迟最高降低77.5%,且0.5秒内可传递98.33%的视觉证据,充分证明了级联推理与弹性传输联合优化的有效性。