Recent advancements in generative AI have suggested that by taking visual prompt, GPT-4V can demonstrate significant proficiency in image recognition task. Despite its impressive capabilities, the financial cost associated with GPT-4V's inference presents a substantial barrier for its wide use. To address this challenge, our work introduces Collage Prompting, a budget-friendly prompting approach that concatenates multiple images into a single visual input. With collage prompt, GPT-4V is able to perform image recognition on several images simultaneously. Based on the observation that the accuracy of GPT-4V's image recognition varies significantly with the order of images within the collage prompt, our method further learns to optimize the arrangement of images for maximum recognition accuracy. A graph predictor is trained to indicate the accuracy of each collage prompt, then we propose an optimization method to navigate the search space of possible image arrangements. Experiment results across various datasets demonstrate the cost-efficiency score of collage prompt is much larger than standard prompt. Additionally, collage prompt with learned arrangement achieves clearly better accuracy than collage prompt with random arrangement in GPT-4V's visual recognition.
翻译:近期生成式AI的进展表明,通过视觉提示,GPT-4V在图像识别任务中展现出显著能力。尽管其性能令人印象深刻,但GPT-4V推理相关的财务成本仍是其广泛应用的主要障碍。为解决此问题,本研究提出"拼贴提示"(Collage Prompting)——一种经济高效的提示方法,将多张图像拼接为单一视觉输入。通过拼贴提示,GPT-4V能同时对多张图像执行识别。基于对GPT-4V图像识别准确率随拼贴提示内图像排列顺序显著变化的观察,本方法进一步学习优化图像排列以最大化识别准确率。我们训练一个图预测器来指示各拼贴提示的准确率,并设计优化方法以遍历可能的图像排列空间。跨多种数据集的实验结果表明,拼贴提示的成本效率分数显著高于标准提示。此外,在GPT-4V的视觉识别中,采用学习排列的拼贴提示在准确率上明显优于随机排列的拼贴提示。