While impressive performance has been achieved in image captioning, the limited diversity of the generated captions and the large parameter scale remain major barriers to the real-word application of these systems. In this work, we propose a lightweight image captioning network in combination with continuous diffusion, called Prefix-diffusion. To achieve diversity, we design an efficient method that injects prefix image embeddings into the denoising process of the diffusion model. In order to reduce trainable parameters, we employ a pre-trained model to extract image features and further design an extra mapping network. Prefix-diffusion is able to generate diverse captions with relatively less parameters, while maintaining the fluency and relevance of the captions benefiting from the generative capabilities of the diffusion model. Our work paves the way for scaling up diffusion models for image captioning, and achieves promising performance compared with recent approaches.
翻译:在图像描述领域已取得显著性能提升,但生成描述的有限多样性及大规模参数量仍是此类系统实际应用的主要障碍。本文提出一种结合连续扩散的轻量级图像描述网络——前缀扩散(Prefix-diffusion)。为实现多样性,我们设计了一种高效方法,将前缀图像嵌入注入扩散模型的去噪过程。为减少可训练参数,采用预训练模型提取图像特征,并进一步设计额外的映射网络。前缀扩散能在保持描述流畅性与相关性的同时(得益于扩散模型的生成能力),以相对较少的参数生成多样化描述。该工作为扩展图像描述的扩散模型规模铺平了道路,且与现有方法相比取得了有竞争力的性能表现。