Visual textures -- spatially homogeneous image regions containing repeated elements (e.g. a field of grass, the bark of a tree) -- are ubiquitous in visual scenes and provide important cues for recognizing and analyzing materials and objects. A number of existing texture models extract essential statistics from a single texture image, and can then generate high-quality samples that are visually similar to the original by matching these statistics. However, their statistics are either hand-designed or based on a network pretrained for another purpose (e.g., object recognition). Here, we develop the first principled method for unsupervised learning of a set of statistics that are used to constrain a maximum entropy probability model. We leverage methods developed for generative diffusion models to derive training and sampling procedures, and compare these to the traditional method of sampling via matching the statistics. Despite the compactness of our trained model (512 statistics), it generates texture images whose quality is as good as or better than the current state-of-the-art model (~177k statistics). A more direct comparison of the two models, obtained by synthesizing images that are indistinguishable for one model but maximally different for the other, reveals their relative strengths and weaknesses. Finally, we show that unlike previous statistical texture models, a straight trajectory in the representation space of our model generates homogeneous texture samples that interpolate smoothly between the features of the two end points.
翻译:视觉纹理——包含重复元素(如草地、树皮)的空间均匀图像区域——在视觉场景中普遍存在,并为识别和分析材料与物体提供重要线索。现有多种纹理模型从单张纹理图像中提取关键统计量,并通过匹配这些统计量生成与原始图像视觉相似的高质量样本。然而,这些统计量要么由人工设计,要么基于为其他任务(如物体识别)预训练的神经网络提取。本文首次提出一种用于无监督学习统计量的严谨方法,这些统计量用于约束最大熵概率模型。我们利用生成式扩散模型的方法推导训练与采样流程,并将其与通过匹配统计量进行采样的传统方法进行对比。尽管我们的训练模型非常紧凑(512个统计量),但其生成的纹理图像质量与当前最先进模型(约17.7万个统计量)相当甚至更优。通过合成对某一模型不可区分但对另一模型差异最大的图像,对两个模型进行直接比较,揭示了各自的相对优势与不足。最终,我们证明与以往的统计纹理模型不同,在我们的模型表征空间中沿直线轨迹生成的均匀纹理样本,能平滑插值于两个端点的特征之间。