Multimodal large language models (MLLMs) demonstrate exceptional capabilities in semantic understanding and visual reasoning, yet they still face challenges in precise object localization and resource-constrained edge-cloud deployment. To address this, this paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation through the collaboration between lightweight edge detectors and cloud-based MLLMs. To enhance the cloud MLLM's robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy with visual-semantic collaborative augmentation, significantly improving classification accuracy and semantic consistency. Furthermore, to maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm. Experimental results demonstrate that AIVD substantially reduces resource consumption while improving MLLM classification performance and semantic generation quality. The proposed scheduling strategy also achieves higher throughput and lower latency across diverse scenarios.
翻译:多模态大语言模型(MLLMs)在语义理解和视觉推理方面展现出卓越能力,但在精确目标定位以及资源受限的边缘-云端部署方面仍面临挑战。为此,本文提出AIVD框架,通过轻量级边缘检测器与云端MLLMs的协同,实现了统一的精确定位与高质量语义生成。为增强云端MLLM对边缘裁剪框噪声及场景变化的鲁棒性,我们设计了基于视觉-语义协同增强的高效微调策略,显著提升了分类精度与语义一致性。此外,为在异构边缘设备与动态网络条件下维持高吞吐量与低延迟,我们提出了一种异构资源感知的动态调度算法。实验结果表明,AIVD在显著降低资源消耗的同时,提升了MLLM的分类性能与语义生成质量。所提出的调度策略在多样化场景下亦实现了更高的吞吐量与更低的延迟。