Fonts convey different impressions to readers. These impressions often come from the font shapes. However, the correlation between fonts and their impression is weak and unstable because impressions are subjective. To capture such weak and unstable cross-modal correlation between font shapes and their impressions, we propose Impression-CLIP, which is a novel machine-learning model based on CLIP (Contrastive Language-Image Pre-training). By using the CLIP-based model, font image features and their impression features are pulled closer, and font image features and unrelated impression features are pushed apart. This procedure realizes co-embedding between font image and their impressions. In our experiment, we perform cross-modal retrieval between fonts and impressions through co-embedding. The results indicate that Impression-CLIP achieves better retrieval accuracy than the state-of-the-art method. Additionally, our model shows the robustness to noise and missing tags.
翻译:摘要:字体向读者传递不同的印象,这些印象通常源于字体的形状。然而,字体与其印象之间的关联较弱且不稳定,因为印象具有主观性。为捕捉字体形状与印象之间这种弱且不稳定的跨模态关联,我们提出了Impression-CLIP——一种基于CLIP(对比语言-图像预训练)的新型机器学习模型。通过使用基于CLIP的模型,字体图像特征与其印象特征被拉近,而字体图像特征与无关印象特征则被推远。这一过程实现了字体图像与印象之间的共嵌入。实验中,我们通过共嵌入进行字体与印象之间的跨模态检索。结果表明,Impression-CLIP的检索准确率优于现有最先进方法。此外,我们的模型对噪声和标签缺失具有鲁棒性。