Contrastive learning-based video-language representation learning approaches, e.g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs. To clarify this coarse-grained global interaction and move a step further, we have to encounter challenging shell-breaking interactions for fine-grained cross-modal learning. In this paper, we creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity. Concretely, we propose Hierarchical Banzhaf Interaction (HBI) to value possible correspondence between video frames and text words for sensitive and explainable cross-modal contrast. To efficiently realize the cooperative game of multiple video frames and multiple text words, the proposed method clusters the original video frames (text words) and computes the Banzhaf Interaction between the merged tokens. By stacking token merge modules, we achieve cooperative games at different semantic levels. Extensive experiments on commonly used text-video retrieval and video-question answering benchmarks with superior performances justify the efficacy of our HBI. More encouragingly, it can also serve as a visualization tool to promote the understanding of cross-modal interaction, which have a far-reaching impact on the community. Project page is available at https://jpthu17.github.io/HBI/.
翻译:基于对比学习的视频-语言表示学习方法(如CLIP)通过预定义的视频-文本对实现语义交互,已取得显著性能。为深入解析这种粗粒度的全局交互并进一步探索,我们需应对精细跨模态学习中具有挑战性的“破壳式”交互问题。本文创新性地将视频-文本建模为多元合作博弈中的玩家,以智慧地处理细粒度语义交互中存在的粒度多样性、组合灵活性及强度模糊性所带来的不确定性。具体而言,我们提出层级班扎夫交互(HBI),用于评估视频帧与文本词之间可能的对应关系,以实现敏感且可解释的跨模态对比。为高效实现多视频帧与多文本词间的合作博弈,所提方法对原始视频帧(文本词)进行聚类,并计算合并词元间的班扎夫交互。通过堆叠词元合并模块,我们在不同语义层级实现合作博弈。在常用文本-视频检索和视频问答基准上的广泛实验表明,HBI方法具有卓越性能。更令人振奋的是,该方法还可作为可视化工具促进对跨模态交互的理解,这将对该领域产生深远影响。项目页面:https://jpthu17.github.io/HBI/。