We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness as a universal task router that can solve a diverse range of computer vision tasks. Project page: https://github.com/yuanze-lin/Olympus_page
翻译:本文提出Olympus,一种将多模态大语言模型(MLLMs)转化为能够处理广泛计算机视觉任务的统一框架的新方法。Olympus利用一个控制器MLLM,将图像、视频和3D对象上的超过20项专门任务分派给专用模块。这种基于指令的路由机制通过链式动作实现复杂工作流,无需训练庞大的生成模型。Olympus能够轻松与现有MLLMs集成,以可比的性能扩展其能力。实验结果表明,Olympus在20项任务中平均路由准确率达到94.75%,在链式动作场景中精确度达到91.82%,证明了其作为通用任务路由器在解决多样化计算机视觉任务方面的有效性。项目页面:https://github.com/yuanze-lin/Olympus_page