Image classification models often demonstrate unstable performance in real-world applications due to variations in image information, driven by differing visual perspectives of subject objects and lighting discrepancies. To mitigate these challenges, existing studies commonly incorporate additional modal information matching the visual data to regularize the model's learning process, enabling the extraction of high-quality visual features from complex image regions. Specifically, in the realm of multimodal learning, cross-modal alignment is recognized as an effective strategy, harmonizing different modal information by learning a domain-consistent latent feature space for visual and semantic features. However, this approach may face limitations due to the heterogeneity between multimodal information, such as differences in feature distribution and structure. To address this issue, we introduce a Multimodal Alignment and Reconstruction Network (MARNet), designed to enhance the model's resistance to visual noise. Importantly, MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains. Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model. It is a plug-and-play framework that can be rapidly integrated into various image classification frameworks, boosting model performance.
翻译:图像分类模型在实际应用中常因图像信息变化而表现出性能不稳定,这种变化主要由目标对象的视觉视角差异及光照不一致所驱动。为缓解这些挑战,现有研究通常引入与视觉数据匹配的额外模态信息来规范化模型的学习过程,从而能够从复杂图像区域中提取高质量的视觉特征。具体而言,在多模态学习领域,跨模态对齐被认为是一种有效策略,它通过学习一个视觉与语义特征间领域一致的潜在特征空间来协调不同模态信息。然而,由于多模态信息间的异质性(如特征分布与结构的差异),该方法可能面临局限。为解决此问题,我们提出了一种多模态对齐与重建网络(MARNet),旨在增强模型对视觉噪声的抵抗能力。值得注意的是,MARNet包含一个跨模态扩散重建模块,用于在不同领域间平滑稳定地融合信息。在Vireo-Food172和Ingredient-101两个基准数据集上进行的实验表明,MARNet有效提升了模型提取的图像信息质量。该框架即插即用,可快速集成到多种图像分类框架中,从而提升模型性能。