Weakly supervised object localization (WSOL) remains challenging when learning object localization models from image category labels. Conventional methods that discriminatively train activation models ignore representative yet less discriminative object parts. In this study, we propose a generative prompt model (GenPromp), defining the first generative pipeline to localize less discriminative object parts by formulating WSOL as a conditional image denoising procedure. During training, GenPromp converts image category labels to learnable prompt embeddings which are fed to a generative model to conditionally recover the input image with noise and learn representative embeddings. During inference, enPromp combines the representative embeddings with discriminative embeddings (queried from an off-the-shelf vision-language model) for both representative and discriminative capacity. The combined embeddings are finally used to generate multi-scale high-quality attention maps, which facilitate localizing full object extent. Experiments on CUB-200-2011 and ILSVRC show that GenPromp respectively outperforms the best discriminative models by 5.2% and 5.6% (Top-1 Loc), setting a solid baseline for WSOL with the generative model. Code is available at https://github.com/callsys/GenPromp.
翻译:弱监督目标定位(WSOL)在仅依赖图像类别标签学习目标定位模型时仍面临挑战。传统方法通过判别式训练激活模型,但忽略了具有代表性却判别性较弱的目标部位。本研究提出生成式提示模型(GenPromp),首次定义了通过将WSOL形式化为条件图像去噪过程来定位弱判别性目标部位的生成式流水线。在训练阶段,GenPromp将图像类别标签转换为可学习的提示嵌入,并将其输入生成模型,以条件方式从带噪图像中恢复原图并学习代表性嵌入。在推理阶段,GenPromp将代表性嵌入与判别性嵌入(从现成的视觉-语言模型中查询获得)相结合,同时具备代表性与判别性能力。最终利用组合嵌入生成多尺度高质量注意力图,从而定位目标的完整范围。在CUB-200-2011和ILSVRC数据集上的实验表明,GenPromp在Top-1定位准确率上分别比最优判别模型提升5.2%和5.6%,为生成式模型在WSOL中的应用建立了坚实基准。代码开源地址:https://github.com/callsys/GenPromp。