The evolution of text to visual components facilitates people's daily lives, such as generating image, videos from text and identifying the desired elements within the images. Computer vision models involving the multimodal abilities in the previous days are focused on image detection, classification based on well-defined objects. Large language models (LLMs) introduces the transformation from nature language to visual objects, which present the visual layout for text contexts. OpenAI GPT-4 has emerged as the pinnacle in LLMs, while the computer vision (CV) domain boasts a plethora of state-of-the-art (SOTA) models and algorithms to convert 2D images to their 3D representations. However, the mismatching between the algorithms with the problem could lead to undesired results. In response to this challenge, we propose an unified VisionGPT-3D framework to consolidate the state-of-the-art vision models, thereby facilitating the development of vision-oriented AI. VisionGPT-3D provides a versatile multimodal framework building upon the strengths of multimodal foundation models. It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models, identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs such as text prompts. Keywords: VisionGPT-3D, 3D vision understanding, Multimodal agent
翻译:文本到视觉组件的演进促进了人们的日常生活,例如从文本生成图像和视频,以及在图像中识别所需元素。早期涉及多模态能力的计算机视觉模型专注于基于明确定义目标的图像检测和分类。大语言模型(LLMs)引入了从自然语言到视觉对象的转换,为文本上下文呈现了视觉布局。OpenAI GPT-4 已成为大语言模型的巅峰,而计算机视觉(CV)领域则拥有大量将二维图像转换为其三维表示的先进(SOTA)模型和算法。然而,算法与问题之间的不匹配可能导致不理想的结果。针对这一挑战,我们提出了一个统一的 VisionGPT-3D 框架,以整合先进的视觉模型,从而促进面向视觉的人工智能的发展。VisionGPT-3D 提供了一个通用的多模态框架,构建于多模态基础模型的优势之上。它无缝集成了各种先进的视觉模型,实现了先进视觉模型选择的自动化,识别与二维深度图分析相对应的合适三维网格创建算法,并基于文本提示等多样化的多模态输入生成最优结果。关键词:VisionGPT-3D,三维视觉理解,多模态智能体