Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI. Notably, Bard has recently been updated to handle visual inputs alongside text prompts during conversations. Given Bard's impressive track record in handling textual inputs, we explore its capabilities in understanding and interpreting visual data (images) conditioned by text questions. This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models, especially in addressing complex computer vision problems that demand accurate visual and language understanding. Specifically, in this study, we focus on 15 diverse task scenarios encompassing regular, camouflaged, medical, under-water and remote sensing data to comprehensively evaluate Bard's performance. Our primary finding indicates that Bard still struggles in these vision scenarios, highlighting the significant gap in vision-based understanding that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, leading to enhanced capabilities in comprehending and interpreting fine-grained visual data. Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand
翻译:谷歌的Bard已成为OpenAI的ChatGPT在对话式AI领域中的有力竞争者。值得注意的是,Bard近期已更新,能够在对话过程中处理视觉输入与文本提示。鉴于Bard在处理文本输入方面的卓越表现,我们探索了其在文本问题引导下理解和解读视觉数据(图像)的能力。这一探索有望为Bard及其他即将问世的多模态生成模型揭示新的见解与挑战,特别是在解决需要精确视觉和语言理解的复杂计算机视觉问题方面。具体而言,本研究涵盖了包含常规、伪装、医学、水下及遥感数据在内的15种多样化任务场景,以全面评估Bard的性能。我们的主要发现表明,Bard在这些视觉场景中仍面临困难,这凸显了在基于视觉的理解方面尚存在巨大差距,亟需在未来发展中弥合。我们期望这项实证研究能为推进未来模型的发展提供宝贵价值,从而增强它们理解和解读精细视觉数据的能力。我们的项目发布在https://github.com/htqin/GoogleBard-VisUnderstand。