BMCR: Adaptive Backbone Module Composition via Reinforcement Learning for Remote Sensing Object Detection

In remote sensing object detection, Convolutional Neural Networks (CNNs) excel at capturing local details while Vision Transformers (ViTs) are better at global context modeling. However, existing detectors typically rely on a single fixed backbone or a manually designed hybrid architecture, and thus fail to adaptively exploit these complementary strengths across inputs of diverse complexity. To address this limitation, we propose Backbone Module Composition via Reinforcement Learning (BMCR). BMCR dynamically assembles input-adaptive inference paths from reusable modules decomposed from off-the-shelf CNN and ViT backbones. To enable such cross-family composition, we first construct an extensible module toolbox. Specifically, we decompose representative CNN and ViT backbones into reusable functional modules and encapsulate each module with explicit structural, semantic, and computational metadata for compatibility-aware assembly. To bridge the gap between grid-based CNN features and token-based ViT representations, we design a lightweight Optimal Transport (OT) based transition interface that ensures distribution-aware alignment while respecting spatial consistency. The backbone composition process is then formulated as a sequential decision problem, in which a policy network progressively selects task-relevant modules according to intermediate multi-scale observations. To stabilize the joint optimization of reusable modules and the routing policy, we further develop an Adaptive Module Cooperative Optimization (AMCO) strategy that coordinates module updating, routing exploration, and reward assignment during training. On DOTA-v1.0, DOTA-v1.5 and DIOR-R, BMCR achieves 79.31\%, 73.41\% and 71.86\% mAP, respectively, surpassing strong static and dynamic baselines by up to 2.5 points while maintaining competitive efficiency.

翻译：在遥感目标检测中，卷积神经网络（CNN）擅长捕捉局部细节，而视觉Transformer（ViT）更擅长全局上下文建模。然而，现有检测器通常依赖单一固定骨干网络或人工设计的混合架构，因此无法根据输入复杂度的多样性自适应地利用这些互补优势。针对这一局限，我们提出基于强化学习的骨干模块组合方法（BMCR）。BMCR从现成CNN和ViT骨干网络分解出的可复用模块中，动态组装输入自适应推理路径。为实现跨家族模块组合，我们首先构建了一个可扩展的模块工具箱。具体而言，我们将代表性CNN和ViT骨干网络分解为可复用的功能模块，并封装每个模块的显式结构、语义和计算元数据，以支持兼容性感知的组装。为弥合基于网格的CNN特征与基于标记的ViT表示之间的差距，我们设计了一种轻量级基于最优传输（OT）的过渡接口，在保证空间一致性的同时实现分布感知的对齐。随后，骨干网络组合过程被建模为序列决策问题，策略网络根据中间多尺度观测逐步选择任务相关模块。为稳定可复用模块与路由策略的联合优化，我们进一步开发了自适应模块协同优化（AMCO）策略，在训练过程中协调模块更新、路由探索与奖励分配。在DOTA-v1.0、DOTA-v1.5和DIOR-R数据集上，BMCR分别实现了79.31%、73.41%和71.86%的mAP，在保持竞争性效率的同时，比强大的静态和动态基线高出最多2.5个百分点。