Large text-to-image models have achieved astonishing performance in synthesizing diverse and high-quality images guided by texts. With detail-oriented conditioning control, even finer-grained spatial control can be achieved. However, some generated images still appear unreasonable, even with plentiful object features and a harmonious style. In this paper, we delve into the underlying causes and find that deep-level logical information, serving as common-sense knowledge, plays a significant role in understanding and processing images. Nonetheless, almost all models have neglected the importance of logical relations in images, resulting in poor performance in this aspect. Following this observation, we propose LogicalDefender, which combines images with the logical knowledge already summarized by humans in text. This encourages models to learn logical knowledge faster and better, and concurrently, extracts the widely applicable logical knowledge from both images and human knowledge. Experiments show that our model has achieved better logical performance, and the extracted logical knowledge can be effectively applied to other scenarios.
翻译:大型文生图模型在根据文本指导合成多样且高质量图像方面取得了惊人性能。通过细节导向的条件控制,甚至可以实现更精细的空间控制。然而,即便具备丰富的物体特征与和谐的风格,部分生成图像仍显得不合理。本文深入探究了其根本原因,发现作为常识知识的深层逻辑信息在图像理解与处理过程中发挥着重要作用。然而,几乎所有模型都忽视了图像中逻辑关系的重要性,导致在此方面表现欠佳。基于这一发现,我们提出LogicalDefender,将图像与人类已总结的文本逻辑知识相结合。该方法既促使模型更快更好地学习逻辑知识,也能从图像与人类知识中同步提取具有广泛适用性的逻辑知识。实验表明,我们的模型在逻辑性能上取得了更优表现,且提取出的逻辑知识可有效应用于其他场景。