Encoded representations from a pretrained deep learning model (e.g., BERT text embeddings, penultimate CNN layer activations of an image) convey a rich set of features beneficial for information retrieval. Embeddings for a particular modality of data occupy a high-dimensional space of its own, but it can be semantically aligned to another by a simple mapping without training a deep neural net. In this paper, we take a simple mapping computed from the least squares and singular value decomposition (SVD) for a solution to the Procrustes problem to serve a means to cross-modal information retrieval. That is, given information in one modality such as text, the mapping helps us locate a semantically equivalent data item in another modality such as image. Using off-the-shelf pretrained deep learning models, we have experimented the aforementioned simple cross-modal mappings in tasks of text-to-image and image-to-text retrieval. Despite simplicity, our mappings perform reasonably well reaching the highest accuracy of 77% on recall@10, which is comparable to those requiring costly neural net training and fine-tuning. We have improved the simple mappings by contrastive learning on the pretrained models. Contrastive learning can be thought as properly biasing the pretrained encoders to enhance the cross-modal mapping quality. We have further improved the performance by multilayer perceptron with gating (gMLP), a simple neural architecture.
翻译:预训练深度学习模型(例如BERT文本嵌入、CNN图像倒数第二层激活值)生成的编码表示蕴含着丰富特征,对信息检索具有重要价值。特定模态数据的嵌入分布于其专属高维空间,但通过简单映射即可实现与其他模态的语义对齐,无需训练深度神经网络。本文采用基于最小二乘法和奇异值分解(SVD)求解Procrustes问题的简单映射,作为跨模态信息检索的手段:即给定某模态信息(如文本),该映射可帮助定位另一模态(如图像)中的语义等价数据项。我们利用现成预训练深度学习模型,在文本到图像与图像到文本检索任务中验证了前述简单跨模态映射方法。尽管方法简洁,该映射仍表现良好,在recall@10指标上最高达到77%的准确率,可与需要昂贵神经网络训练与微调的方法相媲美。我们进一步通过对比学习改进预训练模型的简单映射——对比学习可理解为对预训练编码器进行适当偏置以提升跨模态映射质量,并采用含门控机制的多层感知机(gMLP)这一简洁神经架构实现了性能的再次提升。