Self-supervised Photographic Image Layout Representation Learning

In the domain of image layout representation learning, the critical process of translating image layouts into succinct vector forms is increasingly significant across diverse applications, such as image retrieval, manipulation, and generation. Most approaches in this area heavily rely on costly labeled datasets and notably lack in adapting their modeling and learning methods to the specific nuances of photographic image layouts. This shortfall makes the learning process for photographic image layouts suboptimal. In our research, we directly address these challenges. We innovate by defining basic layout primitives that encapsulate various levels of layout information and by mapping these, along with their interconnections, onto a heterogeneous graph structure. This graph is meticulously engineered to capture the intricate layout information within the pixel domain explicitly. Advancing further, we introduce novel pretext tasks coupled with customized loss functions, strategically designed for effective self-supervised learning of these layout graphs. Building on this foundation, we develop an autoencoder-based network architecture skilled in compressing these heterogeneous layout graphs into precise, dimensionally-reduced layout representations. Additionally, we introduce the LODB dataset, which features a broader range of layout categories and richer semantics, serving as a comprehensive benchmark for evaluating the effectiveness of layout representation learning methods. Our extensive experimentation on this dataset demonstrates the superior performance of our approach in the realm of photographic image layout representation learning.

翻译：在图像布局表示学习领域，将图像布局转换为简洁向量形式的关键过程在图像检索、操作和生成等多种应用中日益重要。该领域的大多数方法严重依赖昂贵的有标签数据集，并且在建模和学习方法上明显缺乏对摄影图像布局特有细微差别的适应性。这一缺陷使得摄影图像布局的学习过程效果欠佳。在我们的研究中，我们直接应对这些挑战。我们通过定义封装不同级别布局信息的基本布局基元，并将这些基元及其相互关系映射到异构图结构上实现创新。该图经过精心设计，以显式捕捉像素域内的复杂布局信息。进一步地，我们引入了新颖的预文本任务以及定制的损失函数，策略性地设计用于这些布局图的有效自监督学习。在此基础之上，我们开发了一种基于自编码器的网络架构，该架构擅长将异构布局图压缩为精确、降维后的布局表示。此外，我们引入了LODB数据集，该数据集包含更广泛的布局类别和更丰富的语义，作为评估布局表示学习方法有效性的全面基准。我们在该数据集上的大量实验证明了我们的方法在摄影图像布局表示学习领域的优越性能。