Quantifying the degree of similarity between images is a key copyright issue for image-based machine learning. In legal doctrine however, determining the degree of similarity between works requires subjective analysis, and fact-finders (judges and juries) can demonstrate considerable variability in these subjective judgement calls. Images that are structurally similar can be deemed dissimilar, whereas images of completely different scenes can be deemed similar enough to support a claim of copying. We seek to define and compute a notion of "conceptual similarity" among images that captures high-level relations even among images that do not share repeated elements or visually similar components. The idea is to use a base multi-modal model to generate "explanations" (captions) of visual data at increasing levels of complexity. Then, similarity can be measured by the length of the caption needed to discriminate between the two images: Two highly dissimilar images can be discriminated early in their description, whereas conceptually dissimilar ones will need more detail to be distinguished. We operationalize this definition and show that it correlates with subjective (averaged human evaluation) assessment, and beats existing baselines on both image-to-image and text-to-text similarity benchmarks. Beyond just providing a number, our method also offers interpretability by pointing to the specific level of granularity of the description where the source data are differentiated.
翻译:量化图像之间的相似度是基于图像的机器学习的关键版权问题。然而,在法律原则中,判断作品之间的相似度需要主观分析,而事实认定者(法官和陪审团)在这些主观判断中可能表现出相当大的差异性。结构相似的图像可能被认为不相似,而完全不同场景的图像可能被认为足够相似,从而支持抄袭的主张。我们旨在定义并计算图像间的“概念相似性”,这种相似性能够捕捉高层级关系,即使图像之间没有重复元素或视觉相似的组成部分。其核心思想是使用基础多模态模型生成视觉数据的“解释”(即标题),且这些标题的复杂度逐步提升。然后,相似性可以通过区分两幅图像所需的标题长度来衡量:两幅高度不相似的图像可以在描述早期就被区分,而概念上相似的图像则需要更多细节才能被区分。我们将这一定义操作化,并表明它与主观(人类平均评估)评价相关,且在图像间和文本间相似性基准测试上优于现有基线方法。除了提供一个数值外,我们的方法还通过指出源数据被区分的具体描述粒度层级来实现可解释性。