Understanding long text is of great demands in practice but beyond the reach of most language-image pre-training (LIP) models. In this work, we empirically confirm that the key reason causing such an issue is that the training images are usually paired with short captions, leaving certain tokens easily overshadowed by salient tokens. Towards this problem, our initial attempt is to relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text (e.g., in the image classification task). Then, after incorporating corner tokens to aggregate diverse textual information, we manage to help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding. We further look into whether the model can continuously benefit from longer captions and notice a clear trade-off between the performance and the efficiency. Finally, we validate the effectiveness of our approach using a self-constructed large-scale dataset, which consists of 100M long caption oriented text-image pairs. It is noteworthy that, on the task of long-text image retrieval, we beat the competitor using long captions with 11.1% improvement (i.e., from 72.62% to 83.72%). We will release the code, the model, and the new dataset to facilitate the reproducibility and further research. The project page is available at https://wuw2019.github.io/lotlip.
翻译:长文本理解在实际应用中需求广泛,但大多数语言-图像预训练模型尚无法有效处理。本研究通过实证分析发现,导致该问题的关键原因在于训练图像通常仅配有简短描述,致使部分文本标记容易被显著标记所掩盖。针对此问题,我们首先尝试为数据重新标注长文本描述,但直接使用长描述进行训练可能导致模型在短文本理解任务(如图像分类)上的性能下降。随后,通过引入角点标记以聚合多样化的文本信息,我们成功使模型在保持原有短文本理解水平的同时,显著提升了其长文本理解能力。我们进一步探究了模型能否从更长的文本描述中持续获益,并观察到性能与效率之间存在明显的权衡关系。最后,我们在自构建的大规模数据集上验证了方法的有效性,该数据集包含1亿个以长文本描述为导向的图文对。值得注意的是,在长文本图像检索任务中,我们的方法以11.1%的性能提升(即从72.62%提升至83.72%)超越了使用长文本描述的基线模型。我们将公开代码、模型及新构建的数据集,以促进研究的复现与深入探索。项目页面详见 https://wuw2019.github.io/lotlip。