In the domain of vision-language integration, generating detailed image captions poses a significant challenge due to the lack of curated and rich datasets. This study introduces PixLore, a novel method that leverages Querying Transformers through the fine-tuning of the BLIP-2 model using the LoRa method on a standard commercial GPU. The followed approach, which involves training on a carefully assembled dataset from state-of-the-art Computer Vision models combined and augmented by ChatGPT, addresses the question of whether intricate image understanding can be achieved with an ensemble of smaller-scale models, referred to as Knowledge Stitching. Comparative evaluations against major models such as GPT-4 and Google Bard demonstrate that PixLore-2.7B, despite having considerably fewer parameters, is rated higher than the existing State-of-the-Art models in over half of the assessments. Precisely, PixLore outperform Bard and BLIP-2, which score approximately 35.18% and 27.98% lower than PixLore in the task of image captioning. This research not only presents a groundbreaking approach but also highlights the importance of well-curated datasets in enhancing the performance of smaller models.
翻译:在视觉-语言融合领域,由于缺乏精心策划且内容丰富的数据集,生成详细的图像描述是一项重大挑战。本研究提出了PixLore,这是一种新颖的方法,它通过在标准商用GPU上使用LoRa方法对BLIP-2模型进行微调,从而利用查询Transformer。该方法在一个精心组装的、由最先进的计算机视觉模型生成并经ChatGPT结合与增强的数据集上进行训练,旨在探究是否能够通过集成多个较小规模的模型(称为知识缝合)来实现复杂的图像理解。与GPT-4和Google Bard等主要模型的比较评估表明,PixLore-2.7B尽管参数量少得多,但在超过一半的评估中得分高于现有的最先进模型。具体而言,在图像描述任务中,PixLore的表现优于Bard和BLIP-2,后两者的得分分别比PixLore低约35.18%和27.98%。这项研究不仅提出了一种突破性的方法,还强调了精心策划的数据集对于提升较小模型性能的重要性。