We present MST-MIXER - a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in the wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, MST-MIXER first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities which further refines its structure and node embeddings. Finally, the fine-grained graph node features are used to enhance the hidden states of the backbone Vision-Language Model (VLM). MST-MIXER achieves new state-of-the-art results on five challenging benchmarks.
翻译:我们提出MST-MIXER——一种基于通用多模态状态追踪框架的新型视频对话模型。当前声称能实现多模态状态追踪的模型存在两大不足:(1) 仅追踪单一模态(主要为视觉输入),或(2) 仅针对合成数据集,未能反映真实开放场景的复杂性。我们的模型通过解决这两项局限,试图填补这一关键研究空白。具体而言,MST-MIXER首先追踪每个输入模态中最关键的构成要素,随后通过新型多模态图结构学习方法学习局部隐式图,预测各模态选定要素间缺失的潜在结构。接着,将学习到的局部图与特征解析整合,构建跨所有模态融合的全局图,进一步优化其结构及节点嵌入表示。最终,利用细粒度图节点特征增强骨干视觉-语言模型(VLM)的隐层状态。MST-MIXER在五项具有挑战性的基准测试中取得了最新的最优性能。