Recently, multi-modality scene perception tasks, e.g., image fusion and scene understanding, have attracted widespread attention for intelligent vision systems. However, early efforts always consider boosting a single task unilaterally and neglecting others, seldom investigating their underlying connections for joint promotion. To overcome these limitations, we establish the hierarchical dual tasks-driven deep model to bridge these tasks. Concretely, we firstly construct an image fusion module to fuse complementary characteristics and cascade dual task-related modules, including a discriminator for visual effects and a semantic network for feature measurement. We provide a bi-level perspective to formulate image fusion and follow-up downstream tasks. To incorporate distinct task-related responses for image fusion, we consider image fusion as a primary goal and dual modules as learnable constraints. Furthermore, we develop an efficient first-order approximation to compute corresponding gradients and present dynamic weighted aggregation to balance the gradients for fusion learning. Extensive experiments demonstrate the superiority of our method, which not only produces visually pleasant fused results but also realizes significant promotion for detection and segmentation than the state-of-the-art approaches.
翻译:近期,面向智能视觉系统的多模态场景感知任务(如图像融合与场景理解)引发了广泛关注。然而,早期研究往往片面追求提升单一任务性能而忽略其他任务,鲜少探索任务间的潜在关联以实现联合提升。为突破这些局限,我们构建了层级化双任务驱动深度模型来桥接上述任务。具体而言,我们首先构建图像融合模块以融合互补特征,并级联包含视觉判别器与语义特征度量网络的双任务相关模块。我们提出双层视角来系统化表述图像融合及其后续下游任务。为整合不同任务对图像融合的差异化响应,我们将图像融合设为主要目标,将双任务模块设为可学习约束。进而,我们开发高效一阶近似方法计算相应梯度,提出动态加权聚合策略以平衡融合学习的梯度分布。大量实验证明,本方法不仅可生成视觉愉悦的融合结果,且在检测与分割任务中的性能提升显著优于现有最优方法。