We address the challenge of representing long captions in vision-language models, such as CLIP. By design these models are limited by fixed, absolute positional encodings, restricting inputs to a maximum of 77 tokens and hindering performance on tasks requiring longer descriptions. Although recent work has attempted to overcome this limit, their proposed approaches struggle to model token relationships over longer distances and simply extend to a fixed new token length. Instead, we propose a generalizable method, named TULIP, able to upgrade the token length to any length for CLIP-like models. We do so by improving the architecture with relative position encodings, followed by a training procedure that (i) distills the original CLIP text encoder into an encoder with relative position encodings and (ii) enhances the model for aligning longer captions with images. By effectively encoding captions longer than the default 77 tokens, our model outperforms baselines on cross-modal tasks such as retrieval and text-to-image generation. The code repository is available at https://github.com/ivonajdenkoska/tulip.
翻译:我们针对视觉-语言模型(如CLIP)中长文本描述的表示难题展开研究。这类模型受限于固定的绝对位置编码设计,其输入被约束在最多77个令牌以内,导致需要更长描述的任务性能受限。尽管近期研究尝试突破这一限制,但现有方法难以建模长距离令牌关系,且仅能简单扩展至固定的新令牌长度。为此,我们提出一种通用化方法TULIP,能够为类CLIP模型升级至任意令牌长度。该方法通过引入相对位置编码改进模型架构,并采用两阶段训练流程:(1)将原始CLIP文本编码器的知识蒸馏至具备相对位置编码的编码器;(2)增强模型对长文本描述与图像的对齐能力。通过有效编码超过默认77个令牌的文本描述,我们的模型在跨模态检索和文本到图像生成等任务上均超越基线方法。代码仓库已发布于https://github.com/ivonajdenkoska/tulip。