Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at https://github.com/tiktok/huvr.
翻译:图像表征学习模型通常专为识别或生成任务而设计。各种形式的对比学习帮助模型学习将图像转换为对分类、检测和分割任务有用的嵌入向量。另一方面,模型可通过像素级损失、感知损失和对抗损失进行图像重建训练,从而学习对图像生成有用的潜在空间。我们试图通过一种首创的模型来统一这两个方向,该模型学习到的表征同时适用于识别与生成任务。我们将模型训练为隐式神经表征的超网络,该网络学习将图像映射到模型权重以实现快速精确的重建。我们进一步将INR超网络与知识蒸馏相结合,以提升其泛化能力与性能。除了新颖的训练设计,该模型还学习到一个前所未有的压缩嵌入空间,在多种视觉任务中表现出卓越性能。完整模型在图像表征学习领域达到了与最先进方法相竞争的结果,同时通过其高质量微型嵌入实现了生成能力。代码发布于https://github.com/tiktok/huvr。