Semantic communications (SC) have been expected to be a new paradigm shifting to catalyze the next generation communication, whose main concerns shift from accurate bit transmission to effective semantic information exchange in communications. However, the previous and widely-used metrics for images are not applicable to evaluate the image semantic similarity in SC. Classical metrics to measure the similarity between two images usually rely on the pixel level or the structural level, such as the PSNR and the MS-SSIM. Straightforwardly using some tailored metrics based on deep-learning methods in CV community, such as the LPIPS, is infeasible for SC. To tackle this, inspired by BERTScore in NLP community, we propose a novel metric for evaluating image semantic similarity, named Vision Transformer Score (ViTScore). We prove theoretically that ViTScore has 3 important properties, including symmetry, boundedness, and normalization, which make ViTScore convenient and intuitive for image measurement. To evaluate the performance of ViTScore, we compare ViTScore with 3 typical metrics (PSNR, MS-SSIM, and LPIPS) through 4 classes of experiments: (i) correlation with BERTScore through evaluation of image caption downstream CV task, (ii) evaluation in classical image communications, (iii) evaluation in image semantic communication systems, and (iv) evaluation in image semantic communication systems with semantic attack. Experimental results demonstrate that ViTScore is robust and efficient in evaluating the semantic similarity of images. Particularly, ViTScore outperforms the other 3 typical metrics in evaluating the image semantic changes by semantic attack, such as image inverse with Generative Adversarial Networks (GANs). This indicates that ViTScore is an effective performance metric when deployed in SC scenarios.
翻译:语义通信(SC)有望成为一种催生下一代通信的新型范式,其关注点从精确的比特传输转向通信中有效的语义信息交换。然而,以往广泛使用的图像指标并不适用于评估SC中的图像语义相似度。衡量两幅图像相似度的经典指标通常依赖像素级或结构级,例如PSNR和MS-SSIM。直接借用计算机视觉(CV)社区中基于深度学习的定制指标(如LPIPS)在SC中并不可行。为解决这一问题,受自然语言处理(NLP)社区中BERTScore的启发,我们提出了一种评估图像语义相似度的新指标——视觉Transformer评分(ViTScore)。我们从理论上证明了ViTScore具有三个重要性质:对称性、有界性和归一化,这使得ViTScore在图像测量中便捷且直观。为评估ViTScore的性能,我们通过四类实验将其与三种典型指标(PSNR、MS-SSIM和LPIPS)进行比较:(i)通过评估图像描述下游CV任务与BERTScore的相关性,(ii)在经典图像通信中的评估,(iii)在图像语义通信系统中的评估,以及(iv)在存在语义攻击的图像语义通信系统中的评估。实验结果表明,ViTScore在评估图像语义相似度方面稳健且高效。特别是在评估语义攻击(如基于生成对抗网络(GAN)的图像逆操作)导致的图像语义变化时,ViTScore优于其他三种典型指标。这表明ViTScore在SC场景中是一种有效的性能度量指标。