We present MST-MIXER - a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in the wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, MST-MIXER first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities which further refines its structure and node embeddings. Finally, the fine-grained graph node features are used to enhance the hidden states of the backbone Vision-Language Model (VLM). MST-MIXER achieves new state-of-the-art results on five challenging benchmarks.
翻译:我们提出MST-MIXER——一种基于通用多模态状态跟踪框架的新型视频对话模型。当前声称能执行多模态状态跟踪的模型存在两大不足:(1)它们仅跟踪单一模态(主要是视觉输入),或(2)其目标数据集为合成数据,未能反映真实开放场景的复杂性。我们的模型通过解决这两大局限,试图填补这一关键研究空白。具体而言,MST-MIXER首先跟踪每个输入模态中最关键的构成要素,随后通过新型多模态图结构学习方法学习局部潜在图,以预测各模态选定构成要素间缺失的底层结构。接着,将学习到的局部图与特征解析整合,形成跨所有模态混合的全局图,进一步优化其结构及节点嵌入。最终,利用细粒度图节点特征增强骨干视觉语言模型(VLM)的隐藏状态。MST-MIXER在五项具有挑战性的基准测试中取得了最新的最优性能。