Cross-modal coherence modeling is essential for intelligent systems to help them organize and structure information, thereby understanding and creating content of the physical world coherently like human-beings. Previous work on cross-modal coherence modeling attempted to leverage the order information from another modality to assist the coherence recovering of the target modality. Despite of the effectiveness, labeled associated coherency information is not always available and might be costly to acquire, making the cross-modal guidance hard to leverage. To tackle this challenge, this paper explores a new way to take advantage of cross-modal guidance without gold labels on coherency, and proposes the Weak Cross-Modal Guided Ordering (WeGO) model. More specifically, it leverages high-confidence predicted pairwise order in one modality as reference information to guide the coherence modeling in another. An iterative learning paradigm is further designed to jointly optimize the coherence modeling in two modalities with selected guidance from each other. The iterative cross-modal boosting also functions in inference to further enhance coherence prediction in each modality. Experimental results on two public datasets have demonstrated that the proposed method outperforms existing methods for cross-modal coherence modeling tasks. Major technical modules have been evaluated effective through ablation studies. Codes are available at: \url{https://github.com/scvready123/IterWeGO}.
翻译:跨模态连贯性建模对于智能系统至关重要,有助于其组织和结构化信息,从而像人类一样连贯地理解和创建物理世界内容。先前关于跨模态连贯性建模的研究尝试利用来自另一模态的顺序信息来辅助目标模态的连贯性恢复。尽管有效,但带有标注的关联连贯性信息并非总是可用,且获取成本可能较高,使得跨模态引导难以利用。为应对这一挑战,本文探索了一种无需连贯性黄金标注即可利用跨模态引导的新方法,并提出了弱跨模态引导排序模型。具体而言,该模型利用一个模态中高置信度的预测成对顺序作为参考信息,以指导另一模态的连贯性建模。进一步设计了一种迭代学习范式,通过相互选择的引导联合优化两个模态的连贯性建模。迭代式跨模态增强在推理过程中也发挥作用,以进一步提升各模态的连贯性预测。在两个公开数据集上的实验结果表明,所提方法在跨模态连贯性建模任务上优于现有方法。通过消融研究验证了主要技术模块的有效性。代码发布于:\url{https://github.com/scvready123/IterWeGO}。