Recognizing visual entities in a natural language sentence and arranging them in a 2D spatial layout require a compositional understanding of language and space. This task of layout prediction is valuable in text-to-image synthesis as it allows localized and controlled in-painting of the image. In this comparative study it is shown that we can predict layouts from language representations that implicitly or explicitly encode sentence syntax, if the sentences mention similar entity-relationships to the ones seen during training. To test compositional understanding, we collect a test set of grammatically correct sentences and layouts describing compositions of entities and relations that unlikely have been seen during training. Performance on this test set substantially drops, showing that current models rely on correlations in the training data and have difficulties in understanding the structure of the input sentences. We propose a novel structural loss function that better enforces the syntactic structure of the input sentence and show large performance gains in the task of 2D spatial layout prediction conditioned on text. The loss has the potential to be used in other generation tasks where a tree-like structure underlies the conditioning modality. Code, trained models and the USCOCO evaluation set will be made available via github.
翻译:从自然语言句子中识别视觉实体并将其排列成二维空间布局,需要具备语言与空间的结构化理解能力。这种布局预测任务在文本到图像合成中具有重要价值,因为它能实现对图像的局部化、可控式修补。本比较研究表明:当句子提及与训练数据相似的实体-关系组合时,我们可以从隐式或显式编码句子句法的语言表征中预测布局。为测试结构化理解能力,我们收集了一个包含语法正确句子及对应布局的测试集,其中描述的实体与关系组合在训练中极不可能出现。在该测试集上的性能显著下降,表明当前模型依赖训练数据中的相关性,难以理解输入句子的结构。我们提出了一种新型结构化损失函数,能更好地强化输入句子的句法结构,并证明其在以文本为条件的二维空间布局预测任务中能取得显著性能提升。该损失函数具有推广至其他以树状结构为条件生成任务的潜力。代码、训练模型以及USCOCO评估数据集将通过GitHub公开。