Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning can be computationally expensive, while keeping encoders frozen may limit task-specific adaptation. We propose the Tabular-Image Adapter (TI-Adapter), a modality-specific adapter-based fine-tuning framework for efficient multimodal adaptation. TI-Adapter freezes the pretrained tabular encoder and learns an adapter after the extracted tabular embedding, while adapting the image branch with embedding-level and bottleneck-level adapters instead of full fine-tuning. Experiments on 20 tabular-image datasets show that TI-Adapter achieves competitive or better predictive performance than full fine-tuning while using substantially fewer trainable parameters. Ablation studies further demonstrate the importance of adapter placement for balancing performance and practical efficiency.
翻译:表格-图像多模态学习旨在通过联合使用结构化表格属性与视觉数据来提升预测建模能力。尽管预训练编码器能够提供强大的模态特定表示,但完全微调会带来高昂的计算成本,而冻结编码器则可能限制任务特定适配。我们提出表格-图像适配器(TI-Adapter),这是一种基于模态特定适配器的微调框架,用于实现高效多模态适配。TI-Adapter冻结预训练表格编码器,在提取的表格嵌入后学习一个适配器,同时通过嵌入级和瓶颈级适配器对图像分支进行适配,而非完全微调。在20个表格-图像数据集上的实验表明,TI-Adapter在显著减少可训练参数数量的同时,实现了与完全微调相当或更优的预测性能。消融研究进一步证明了适配器位置在平衡性能与实用效率方面的重要性。