With recent advancements in Large Multimodal Models (LMMs) across various domains, a novel prompting method called visual referring prompting has emerged, showing significant potential in enhancing human-computer interaction within multimodal systems. This method offers a more natural and flexible approach to human interaction with these systems compared to traditional text descriptions or coordinates. However, the categorization of visual referring prompting remains undefined, and its impact on the performance of LMMs has yet to be formally examined. In this study, we conduct the first comprehensive analysis of LMMs using a variety of visual referring prompting strategies. We introduce a benchmark dataset called VRPTEST, comprising 3 different visual tasks and 2,275 images, spanning diverse combinations of prompt strategies. Using VRPTEST, we conduct a comprehensive evaluation of eight versions of prominent open-source and proprietary foundation models, including two early versions of GPT-4V. We develop an automated assessment framework based on software metamorphic testing techniques to evaluate the accuracy of LMMs without the need for human intervention or manual labeling. We find that the current proprietary models generally outperform the open-source ones, showing an average accuracy improvement of 22.70%; however, there is still potential for improvement. Moreover, our quantitative analysis shows that the choice of prompt strategy significantly affects the accuracy of LMMs, with variations ranging from -17.5% to +7.3%. Further case studies indicate that an appropriate visual referring prompting strategy can improve LMMs' understanding of context and location information, while an unsuitable one might lead to answer rejection. We also provide insights on minimizing the negative impact of visual referring prompting on LMMs.
翻译:随着大型多模态模型(LMMs)在各领域的近期进展,一种名为“视觉指代提示”的新型提示方法应运而生,在增强多模态系统人机交互方面展现出显著潜力。相较于传统的文本描述或坐标方式,该方法提供了更自然灵活的人机交互途径。然而,视觉指代提示的分类尚未明确界定,其对LMMs性能的影响也未得到系统检验。本研究首次采用多种视觉指代提示策略对LMMs进行综合分析,并引入名为VRPTEST的基准数据集,该数据集涵盖3种不同视觉任务及2275张图像,包含多种提示策略组合。基于VRPTEST,我们对包括两个早期版本GPT-4V在内的八种主流开源与专有基础模型进行了全面评估。我们开发了基于软件蜕变测试技术的自动化评估框架,无需人工干预或手动标注即可评估LMMs的准确性。研究发现,当前专有模型普遍优于开源模型,平均准确率提升22.70%;但仍存在改进空间。此外,定量分析表明,提示策略的选择显著影响LMMs的准确率,其波动范围从-17.5%至+7.3%不等。进一步案例研究显示,恰当的视觉指代提示策略能增强LMMs对上下文和位置信息的理解,而不当策略可能导致答案拒绝。我们同时提出了最小化视觉指代提示对LMMs负面影响的见解。