Large Vision-Language Models (LVLMs) are gaining traction for their remarkable ability to process and integrate visual and textual data. Despite their popularity, the capacity of LVLMs to generate precise, fine-grained textual descriptions has not been fully explored. This study addresses this gap by focusing on \textit{distinctiveness} and \textit{fidelity}, assessing how models like Open-Flamingo, IDEFICS, and MiniGPT-4 can distinguish between similar objects and accurately describe visual features. We proposed the Textual Retrieval-Augmented Classification (TRAC) framework, which, by leveraging its generative capabilities, allows us to delve deeper into analyzing fine-grained visual description generation. This research provides valuable insights into the generation quality of LVLMs, enhancing the understanding of multimodal language models. Notably, MiniGPT-4 stands out for its better ability to generate fine-grained descriptions, outperforming the other two models in this aspect. The code is provided at \url{https://anonymous.4open.science/r/Explore_FGVDs-E277}.
翻译:大型视觉-语言模型(LVLMs)因其处理与融合视觉及文本数据的卓越能力而日益受到关注。尽管广受欢迎,但LVLMs生成精准、细粒度文本描述的能力尚未得到充分探索。本研究聚焦于\textit{区分性}与\textit{保真度},通过评估Open-Flamingo、IDEFICS及MiniGPT-4等模型区分相似目标并准确描述视觉特征的能力,填补了这一研究空白。我们提出了文本检索增强分类(TRAC)框架,该框架利用其生成能力,使我们能更深入地分析细粒度视觉描述的生成过程。本研究为LVLMs的生成质量提供了宝贵见解,深化了对多模态语言模型的理解。值得注意的是,MiniGPT-4在生成细粒度描述方面表现尤为突出,其性能优于另外两个模型。相关代码已发布于\url{https://anonymous.4open.science/r/Explore_FGVDs-E277}。