The evolution of text to visual components facilitates people's daily lives, such as generating image, videos from text and identifying the desired elements within the images. Computer vision models involving the multimodal abilities in the previous days are focused on image detection, classification based on well-defined objects. Large language models (LLMs) introduces the transformation from nature language to visual objects, which present the visual layout for text contexts. OpenAI GPT-4 has emerged as the pinnacle in LLMs, while the computer vision (CV) domain boasts a plethora of state-of-the-art (SOTA) models and algorithms to convert 2D images to their 3D representations. However, the mismatching between the algorithms with the problem could lead to undesired results. In response to this challenge, we propose an unified VisionGPT-3D framework to consolidate the state-of-the-art vision models, thereby facilitating the development of vision-oriented AI. VisionGPT-3D provides a versatile multimodal framework building upon the strengths of multimodal foundation models. It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models, identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs such as text prompts. Keywords: VisionGPT-3D, 3D vision understanding, Multimodal agent
翻译:文本到视觉组件的演进便利了人们的日常生活,例如从文本生成图像和视频,以及识别图像中的目标元素。以往具备多模态能力的计算机视觉模型主要聚焦于基于明确定义物体的图像检测与分类。大语言模型(LLMs)带来了从自然语言到视觉对象的转变,能够呈现文本上下文的视觉布局。OpenAI GPT-4已成为大语言模型领域的巅峰之作,而计算机视觉(CV)领域则拥有大量最先进的(SOTA)模型与算法,用于将二维图像转换为三维表示。然而,算法与问题之间的不匹配可能导致不理想的结果。针对这一挑战,我们提出统一的VisionGPT-3D框架,以整合最先进的视觉模型,从而促进面向视觉的人工智能发展。VisionGPT-3D基于多模态基础模型优势,构建了一个多功能多模态框架。它无缝集成了多种SOTA视觉模型,实现了SOTA视觉模型的自动选择,能根据二维深度图分析识别合适的三维网格创建算法,并基于文本提示等多种多模态输入生成最优结果。关键词:VisionGPT-3D、三维视觉理解、多模态代理