Recent advances in representation learning have demonstrated the significance of multimodal alignment. The Dual Cross-modal Information Disentanglement (DCID) model, utilizing a unified codebook, shows promising results in achieving fine-grained representation and cross-modal generalization. However, it is still hindered by equal treatment of all channels and neglect of minor event information, resulting in interference from irrelevant channels and limited performance in fine-grained tasks. Thus, in this work, We propose a Training-free Optimization of Codebook (TOC) method to enhance model performance by selecting important channels in the unified space without retraining. Additionally, we introduce the Hierarchical Dual Cross-modal Information Disentanglement (H-DCID) approach to extend information separation and alignment to two levels, capturing more cross-modal details. The experiment results demonstrate significant improvements across various downstream tasks, with TOC contributing to an average improvement of 1.70% for DCID on four tasks, and H-DCID surpassing DCID by an average of 3.64%. The combination of TOC and H-DCID further enhances performance, exceeding DCID by 4.43%. These findings highlight the effectiveness of our methods in facilitating robust and nuanced cross-modal learning, opening avenues for future enhancements. The source code and pre-trained models can be accessed at https://github.com/haihuangcode/TOC_H-DCID.
翻译:近期表征学习领域的进展揭示了多模态对齐的重要性。采用统一码本的双跨模态信息解耦模型(DCID)在细粒度表征与跨模态泛化方面展现出良好效果。然而,该模型受限于对所有通道的均等处理以及对次要事件信息的忽视,导致无关通道干扰及细粒度任务性能受限。为此,本文提出免训练码本优化方法(TOC),通过无需重新训练即可筛选统一空间中的重要通道来增强模型性能。同时,引入层级双跨模态信息解耦(H-DCID)方法,将信息分离与对齐扩展至两个层级,以捕获更多跨模态细节。实验结果表明,所提方法在多个下游任务中取得显著提升:TOC使DCID在四项任务上的平均性能提升1.70%,H-DCID较DCID平均提升3.64%,而TOC与H-DCID的联合使用更将性能提升至4.43%。这些发现充分验证了所提方法在促进稳健且精细的跨模态学习中的有效性,为后续研究开辟了新方向。源代码与预训练模型可访问https://github.com/haihuangcode/TOC_H-DCID获取。