Similarity estimation is essential for many game AI applications, from the procedural generation of distinct assets to automated exploration with game-playing agents. While similarity metrics often substitute human evaluation, their alignment with our judgement is unclear. Consequently, the result of their application can fail human expectations, leading to e.g. unappreciated content or unbelievable agent behaviour. We alleviate this gap through a multi-factorial study of two tile-based games in two representations, where participants (N=456) judged the similarity of level triplets. Based on this data, we construct domain-specific perceptual spaces, encoding similarity-relevant attributes. We compare 12 metrics to these spaces and evaluate their approximation quality through several quantitative lenses. Moreover, we conduct a qualitative labelling study to identify the features underlying the human similarity judgement in this popular genre. Our findings inform the selection of existing metrics and highlight requirements for the design of new similarity metrics benefiting game development and research.
翻译:相似性估计对许多游戏人工智能应用至关重要,从不同资产的程序化生成到使用游戏智能体的自动探索。尽管相似性度量常替代人类评估,但其与人类判断的一致性尚不明确。因此,其应用结果可能无法满足人类期望,例如导致内容不受认可或智能体行为不可信。我们通过一项多因素研究来弥合这一差距,该研究涉及两种基于瓦片的游戏及其两种表征形式,参与者(N=456)对关卡三元组的相似性进行判断。基于这些数据,我们构建了领域特定的感知空间,对相似性相关属性进行编码。我们将12种度量与这些空间进行比较,并通过多个定量视角评估其近似质量。此外,我们开展了一项定性标注研究,以识别这一流行类型游戏中人类相似性判断所依据的特征。我们的发现为现有度量的选择提供了参考,并强调了设计新相似性度量以满足游戏开发和研究需求的要求。