This paper proposes Pix2Next, a novel image-to-image translation framework designed to address the challenge of generating high-quality Near-Infrared (NIR) images from RGB inputs. Our approach leverages a state-of-the-art Vision Foundation Model (VFM) within an encoder-decoder architecture, incorporating cross-attention mechanisms to enhance feature integration. This design captures detailed global representations and preserves essential spectral characteristics, treating RGB-to-NIR translation as more than a simple domain transfer problem. A multi-scale PatchGAN discriminator ensures realistic image generation at various detail levels, while carefully designed loss functions couple global context understanding with local feature preservation. We performed experiments on the RANUS dataset to demonstrate Pix2Next's advantages in quantitative metrics and visual quality, improving the FID score by 34.81% compared to existing methods. Furthermore, we demonstrate the practical utility of Pix2Next by showing improved performance on a downstream object detection task using generated NIR data to augment limited real NIR datasets. The proposed approach enables the scaling up of NIR datasets without additional data acquisition or annotation efforts, potentially accelerating advancements in NIR-based computer vision applications.
翻译:本文提出Pix2Next——一种新颖的图像到图像转换框架,旨在解决从RGB输入生成高质量近红外图像的挑战。我们的方法在编码器-解码器架构中集成先进的视觉基础模型,通过交叉注意力机制增强特征融合。该设计不仅能捕捉精细的全局表征,还能保持关键的光谱特性,将RGB到NIR的转换视为超越简单域迁移的问题。多尺度PatchGAN判别器确保在不同细节层次生成逼真图像,而精心设计的损失函数将全局上下文理解与局部特征保存相结合。我们在RANUS数据集上的实验表明,Pix2Next在量化指标与视觉质量方面均具优势,其FID分数较现有方法提升34.81%。此外,我们通过下游目标检测任务验证了Pix2Next的实用价值:使用生成的NIR数据增强有限的实际NIR数据集能有效提升检测性能。该方法可在无需额外数据采集或标注工作的情况下扩展NIR数据集,有望推动基于近红外的计算机视觉应用发展。