We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness as a universal task router that can solve a diverse range of computer vision tasks. Project page: http://yuanze-lin.me/Olympus_page/
翻译:本文提出Olympus,这是一种将多模态大语言模型(MLLMs)转化为能够处理广泛计算机视觉任务的统一框架的新方法。Olympus利用一个控制器MLLM,将超过20项涉及图像、视频和3D对象的专门任务分派给专用模块处理。这种基于指令的路由机制使得无需训练繁重的生成模型即可通过链式动作实现复杂工作流。Olympus能够轻松与现有MLLMs集成,以可比的性能扩展其能力。实验结果表明,Olympus在20项任务上的平均路由准确率达到94.75%,在链式动作场景下的精确率达到91.82%,证明了其作为通用任务路由器在解决多样化计算机视觉任务方面的有效性。项目页面:http://yuanze-lin.me/Olympus_page/