While different neural models often exhibit latent spaces that are alike when exposed to semantically related data, this intrinsic similarity is not always immediately discernible. Towards a better understanding of this phenomenon, our work shows how representations learned from these neural modules can be translated between different pre-trained networks via simpler transformations than previously thought. An advantage of this approach is the ability to estimate these transformations using standard, well-understood algebraic procedures that have closed-form solutions. Our method directly estimates a transformation between two given latent spaces, thereby enabling effective stitching of encoders and decoders without additional training. We extensively validate the adaptability of this translation procedure in different experimental settings: across various trainings, domains, architectures (e.g., ResNet, CNN, ViT), and in multiple downstream tasks (classification, reconstruction). Notably, we show how it is possible to zero-shot stitch text encoders and vision decoders, or vice-versa, yielding surprisingly good classification performance in this multimodal setting.
翻译:尽管不同神经模型在处理语义相关数据时,其隐空间往往表现出相似性,但这种内在相似性并非总是显而易见的。为深入理解这一现象,我们的研究表明,这些神经模块学习到的表征可以通过比以往更简单的变换在不同预训练网络之间进行翻译。该方法的一个优势在于,能够利用具有闭合解的标准代数流程(这些流程已得到充分研究)来估计这些变换。我们的方法直接估计两个给定隐空间之间的变换,从而无需额外训练即可实现编码器与解码器的有效拼接。我们在不同实验设置中广泛验证了该翻译过程的自适应性:涵盖多种训练方式、领域、架构(如ResNet、CNN、ViT)以及多个下游任务(分类、重建)。值得注意的是,我们证明了如何实现文本编码器与视觉解码器(或反之)的零样本拼接,并在多模态设置下取得了令人惊讶的优异分类性能。