MSAM：面向跨模态无人机视频-文本检索的多语义自适应挖掘方法 (MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval)

With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.

翻译：随着无人机技术的进步，视频数据量快速增长，对高效语义检索的需求日益迫切。我们首次系统性地提出并研究了无人机视频-文本检索任务。无人机视频具有俯视视角、强烈的结构同质性和目标组合的多样化语义表达等特点，这使得现有针对地面视角设计的跨模态方法难以有效建模其特性。因此，需要为无人机场景量身定制专门的检索机制。为解决这一问题，我们提出了一种称为多语义自适应挖掘的新方法。MSAM引入了一种多语义自适应学习机制，该机制融合了帧间的动态变化，并从特定场景区域中提取丰富的语义信息，从而增强对无人机视频内容的深度理解与推理。该方法依赖于词语与无人机视频帧之间的细粒度交互，集成了自适应语义构建模块、分布驱动的语义学习项和多样性语义项，以深化文本与无人机视频模态间的交互，并提升特征表示的鲁棒性。为减少无人机视频中复杂背景的干扰，我们引入了一种跨模态交互特征融合池化机制，该机制专注于目标区域的特征提取与匹配，从而最小化噪声影响。在两个自建的无人机视频-文本数据集上进行的大量实验表明，MSAM在无人机视频-文本检索任务中优于其他现有方法。源代码与数据集将公开提供。