Recently, the intersection of Large Language Models (LLMs) and Computer Vision (CV) has emerged as a pivotal area of research, driving significant advancements in the field of Artificial Intelligence (AI). As transformers have become the backbone of many state-of-the-art models in both Natural Language Processing (NLP) and CV, understanding their evolution and potential enhancements is crucial. This survey paper delves into the latest progressions in the domain of transformers and their subsequent successors, emphasizing their potential to revolutionize Vision Transformers (ViTs) and LLMs. This survey also presents a comparative analysis, juxtaposing the performance metrics of several leading paid and open-source LLMs, shedding light on their strengths and areas of improvement as well as a literature review on how LLMs are being used to tackle vision related tasks. Furthermore, the survey presents a comprehensive collection of datasets employed to train LLMs, offering insights into the diverse data available to achieve high performance in various pre-training and downstream tasks of LLMs. The survey is concluded by highlighting open directions in the field, suggesting potential venues for future research and development. This survey aims to underscores the profound intersection of LLMs on CV, leading to a new era of integrated and advanced AI models.
翻译:近年来,大型语言模型(LLMs)与计算机视觉(CV)的交叉已成为一个关键研究领域,推动了人工智能(AI)领域的重大进展。随着Transformer成为自然语言处理(NLP)和计算机视觉中众多最先进模型的骨干架构,理解其演变及潜在改进至关重要。本综述深入探讨了Transformer领域及其后续衍生模型的最新进展,强调其变革视觉Transformer(ViTs)和大型语言模型的潜力。本文还进行了比较分析,对比了多个领先的商业和开源大型语言模型的性能指标,揭示了它们的优势及改进空间,并提供了关于如何利用大型语言模型解决视觉相关任务的文献综述。此外,本综述系统收集了用于训练大型语言模型的数据集,揭示了在大型语言模型的各种预训练和下游任务中实现高性能所需的多样化数据。最后,本文指出了该领域的开放研究方向,为未来研究和发展提供了潜在路径。本综述旨在强调大型语言模型与计算机视觉的深刻交叉,引领集成化先进AI模型的新时代。