In computer vision tasks, features often come from diverse representations, domains (e.g., indoor and outdoor), and modalities (e.g., text, images, and videos). Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains or modalities. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing relationship graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we expand graphs through iterative graph relationship updates and introduce a learnable graph fusion operator to integrate these expanded relationships for more effective fusion. Our approach is relationship-centric, operates in a homogeneous space, and is mathematically principled, resembling element-wise relationship score aggregation via multilinear polynomials. We demonstrate the effectiveness of our graph-based fusion method on video anomaly detection, showing strong performance across multi-representational, multi-modal, and multi-domain feature fusion tasks.
翻译:在计算机视觉任务中,特征常来自多样化的表征、领域(如室内与室外)及模态(如文本、图像与视频)。有效融合这些特征对于实现鲁棒性能至关重要,尤其是在视觉-语言模型等强大预训练模型可用的背景下。然而,常见的融合方法(如拼接、逐元素运算及非线性技术)往往难以捕捉结构关系与深层特征交互,并存在效率低下或跨域/跨模态特征未对齐的问题。本文中,我们通过构建在不同层级(如片段、帧、图像块、词元等)编码特征关系的关系图,将研究重心从高维特征空间转向更低维、可解释的图空间。为捕捉更深层的交互,我们通过迭代的图关系更新对图进行扩展,并引入可学习的图融合算子来整合这些扩展后的关系,以实现更有效的融合。我们的方法以关系为中心,在齐次空间中操作,且具有数学原理支撑——其机制类似于通过多线性多项式进行逐元素关系得分聚合。我们在视频异常检测任务上验证了所提出的基于图的融合方法的有效性,结果表明该方法在多表征、多模态及多领域特征融合任务中均表现出强劲性能。