Cross-Modal Generative Semantic Communications for Mobile AIGC: Joint Semantic Encoding and Prompt Engineering

Employing massive Mobile AI-Generated Content (AIGC) Service Providers (MASPs) with powerful models, high-quality AIGC services can become accessible for resource-constrained end users. However, this advancement, referred to as mobile AIGC, also introduces a significant challenge: users should download large AIGC outputs from the MASPs, leading to substantial bandwidth consumption and potential transmission failures. In this paper, we apply cross-modal Generative Semantic Communications (G-SemCom) in mobile AIGC to overcome wireless bandwidth constraints. Specifically, we utilize a series of cross-modal attention maps to indicate the correlation between user prompts and each part of AIGC outputs. In this way, the MASP can analyze the prompt context and filter the most semantically important content efficiently. Only semantic information is transmitted, with which users can recover the entire AIGC output with high quality while saving mobile bandwidth. Since the transmitted information not only preserves the semantics but also prompts the recovery, we formulate a joint semantic encoding and prompt engineering problem to optimize the bandwidth allocation among users. Particularly, we present a human-perceptual metric named Joint Perpetual Similarity and Quality (JPSQ), which is fused by two learning-based measurements regarding semantic similarity and aesthetic quality, respectively. Furthermore, we develop the Attention-aware Deep Diffusion (ADD) algorithm, which learns attention maps and leverages the diffusion process to enhance the environment exploration ability. Extensive experiments demonstrate that our proposal can reduce the bandwidth consumption of mobile users by 49.4% on average, with almost no perceptual difference in AIGC output quality. Moreover, the ADD algorithm shows superior performance over baseline DRL methods, with 1.74x higher overall reward.

翻译：通过部署具备强大模型的移动人工智能生成内容（AIGC）服务提供商（MASPs），资源受限的终端用户可获得高质量的AIGC服务。然而，这种被称为移动AIGC的进步也带来了重大挑战：用户需要从MASPs下载大量AIGC输出结果，导致显著的带宽消耗和潜在的传输故障。本文在移动AIGC场景中应用跨模态生成式语义通信（G-SemCom）以克服无线带宽限制。具体而言，我们利用一系列跨模态注意力图来表征用户提示与AIGC输出各部分之间的关联性。通过这种方式，MASP能够高效分析提示上下文并筛选最具语义重要性的内容，仅传输语义信息。用户据此可在节省移动带宽的同时，高质量恢复完整的AIGC输出。由于传输信息既保留语义又引导恢复过程，我们构建了联合语义编码与提示工程问题以优化用户间带宽分配。特别地，我们提出名为联合感知相似度与质量（JPSQ）的人类感知度量指标，该指标融合了分别针对语义相似度和美学质量的两项基于学习的评估指标。此外，我们开发了注意力感知深度扩散（ADD）算法，该算法通过学习注意力图并利用扩散过程增强环境探索能力。大量实验表明，本方案可平均降低移动用户49.4%的带宽消耗，且AIGC输出质量几乎无感知差异。同时，ADD算法相较于基线深度强化学习方法展现更优性能，整体奖励提升1.74倍。