The integration of language instructions with robotic control, particularly through Vision Language Action (VLA) models, has shown significant potential. However, these systems are often hindered by high computational costs, the need for extensive retraining, and limited scalability, making them less accessible for widespread use. In this paper, we introduce SVLR (Scalable Visual Language Robotics), an open-source, modular framework that operates without the need for retraining, providing a scalable solution for robotic control. SVLR leverages a combination of lightweight, open-source AI models including the Vision-Language Model (VLM) Mini-InternVL, zero-shot image segmentation model CLIPSeg, Large Language Model Phi-3, and sentence similarity model all-MiniLM to process visual and language inputs. These models work together to identify objects in an unknown environment, use them as parameters for task execution, and generate a sequence of actions in response to natural language instructions. A key strength of SVLR is its scalability. The framework allows for easy integration of new robotic tasks and robots by simply adding text descriptions and task definitions, without the need for retraining. This modularity ensures that SVLR can continuously adapt to the latest advancements in AI technologies and support a wide range of robots and tasks. SVLR operates effectively on an NVIDIA RTX 2070 (mobile) GPU, demonstrating promising performance in executing pick-and-place tasks. While these initial results are encouraging, further evaluation across a broader set of tasks and comparisons with existing VLA models are needed to assess SVLR's generalization capabilities and performance in more complex scenarios.
翻译:语言指令与机器人控制的集成,特别是通过视觉语言动作(VLA)模型,已展现出巨大潜力。然而,这些系统通常受限于高昂的计算成本、需要大量重新训练以及有限的可扩展性,使其难以广泛普及。本文介绍SVLR(可扩展视觉语言机器人),这是一个开源、模块化的框架,无需重新训练即可运行,为机器人控制提供了一个可扩展的解决方案。SVLR利用一组轻量级开源AI模型的组合来处理视觉和语言输入,包括视觉语言模型(VLM)Mini-InternVL、零样本图像分割模型CLIPSeg、大型语言模型Phi-3以及句子相似度模型all-MiniLM。这些模型协同工作,以识别未知环境中的物体,将其作为任务执行的参数,并根据自然语言指令生成一系列动作。SVLR的一个关键优势在于其可扩展性。该框架允许通过简单地添加文本描述和任务定义来轻松集成新的机器人任务和机器人,而无需重新训练。这种模块化设计确保了SVLR能够持续适应AI技术的最新进展,并支持广泛的机器人和任务。SVLR在NVIDIA RTX 2070(移动版)GPU上有效运行,在执行抓取放置任务中展现出良好的性能。虽然这些初步结果令人鼓舞,但仍需在更广泛的任务集上进行进一步评估,并与现有VLA模型进行比较,以评估SVLR在更复杂场景中的泛化能力和性能。