Low-altitude vision systems are becoming a critical infrastructure for smart city governance. However, existing object-centric perception paradigms and loosely coupled vision-language pipelines are still difficult to support management-oriented anomaly understanding required in real-world urban governance. To bridge this gap, we introduce GovLA-10K, the first management-oriented multi-modal benchmark for low-altitude intelligence, along with GovLA-Reasoner, a unified vision-language reasoning framework tailored for governance-aware aerial perception. Unlike existing studies that aim to exhaustively annotate all visible objects, GovLA-10K is deliberately designed around functionally salient targets that directly correspond to practical management needs, and further provides actionable management suggestions grounded in these observations. To effectively coordinate the fine-grained visual grounding with high-level contextual language reasoning, GovLA-Reasoner introduces an efficient feature adapter that implicitly coordinates discriminative representation sharing between the visual detector and the large language model (LLM). Extensive experiments show that our method significantly improves performance while avoiding the need of fine-tuning for any task-specific individual components. We believe our work offers a new perspective and foundation for future studies on management-aware low-altitude vision-language systems.
翻译:低空视觉系统正成为智慧城市治理的关键基础设施。然而,现有的以物体为中心的感知范式与松耦合的视觉-语言流程仍难以支撑现实城市治理所需的、面向管理的异常理解。为弥合这一差距,我们提出了GovLA-10K——首个面向管理的低空智能多模态基准,以及GovLA-Reasoner——一个为治理感知的航空感知量身定制的统一视觉-语言推理框架。与旨在详尽标注所有可见物体的现有研究不同,GovLA-10K是围绕功能显著目标而精心设计的,这些目标直接对应实际管理需求,并进一步基于这些观测提供可操作的管理建议。为了有效协调细粒度视觉定位与高层上下文语言推理,GovLA-Reasoner引入了一个高效的特征适配器,该适配器隐式地协调视觉检测器与大语言模型(LLM)之间的判别性表征共享。大量实验表明,我们的方法显著提升了性能,同时避免了为任何特定任务组件进行微调的需要。我们相信,我们的工作为未来研究治理感知的低空视觉-语言系统提供了新的视角和基础。