Robotic vision applications often necessitate a wide range of visual perception tasks, such as object detection, segmentation, and identification. While there have been substantial advances in these individual tasks, integrating specialized models into a unified vision pipeline presents significant engineering challenges and costs. Recently, Multimodal Large Language Models (MLLMs) have emerged as novel backbones for various downstream tasks. We argue that leveraging the pre-training capabilities of MLLMs enables the creation of a simplified framework, thus mitigating the need for task-specific encoders. Specifically, the large-scale pretrained knowledge in MLLMs allows for easier fine-tuning to downstream robotic vision tasks and yields superior performance. We introduce the RoboLLM framework, equipped with a BEiT-3 backbone, to address all visual perception tasks in the ARMBench challenge-a large-scale robotic manipulation dataset about real-world warehouse scenarios. RoboLLM not only outperforms existing baselines but also substantially reduces the engineering burden associated with model selection and tuning. The source code is publicly available at https://github.com/longkukuhi/armbench.
翻译:机器人视觉应用通常需要涵盖广泛的视觉感知任务,例如目标检测、分割与识别。尽管这些独立任务已取得显著进展,但将专用模型集成至统一视觉流水线仍面临巨大的工程挑战与成本。近期,多模态大语言模型(MLLMs)作为多种下游任务的新型骨干网络崭露头角。我们认为,利用MLLMs的预训练能力可构建简化框架,从而减少对任务专用编码器的需求。具体而言,MLLMs中大规模预训练知识使其更易微调至下游机器人视觉任务,并展现出更优性能。我们提出RoboLLM框架,其配备BEiT-3骨干网络,用于解决ARMBench挑战中的所有视觉感知任务——该挑战基于真实仓库场景的大规模机器人操作数据集。RoboLLM不仅超越了现有基线方法,还大幅降低了与模型选择及调参相关的工程负担。源代码已公开于https://github.com/longkukuhi/armbench。