可扩展、免训练的视觉语言机器人：面向消费级GPU的模块化多模型框架 (Scalable, Training-Free Visual Language Robotics: A Modular Multi-Model Framework for Consumer-Grade GPUs)

The integration of language instructions with robotic control, particularly through Vision Language Action (VLA) models, has shown significant potential. However, these systems are often hindered by high computational costs, the need for extensive retraining, and limited scalability, making them less accessible for widespread use. In this paper, we introduce SVLR (Scalable Visual Language Robotics), an open-source, modular framework that operates without the need for retraining, providing a scalable solution for robotic control. SVLR leverages a combination of lightweight, open-source AI models including the Vision-Language Model (VLM) Mini-InternVL, zero-shot image segmentation model CLIPSeg, Large Language Model Phi-3, and sentence similarity model all-MiniLM to process visual and language inputs. These models work together to identify objects in an unknown environment, use them as parameters for task execution, and generate a sequence of actions in response to natural language instructions. A key strength of SVLR is its scalability. The framework allows for easy integration of new robotic tasks and robots by simply adding text descriptions and task definitions, without the need for retraining. This modularity ensures that SVLR can continuously adapt to the latest advancements in AI technologies and support a wide range of robots and tasks. SVLR operates effectively on an NVIDIA RTX 2070 (mobile) GPU, demonstrating promising performance in executing pick-and-place tasks. While these initial results are encouraging, further evaluation across a broader set of tasks and comparisons with existing VLA models are needed to assess SVLR's generalization capabilities and performance in more complex scenarios.

翻译：语言指令与机器人控制的集成，特别是通过视觉语言动作（VLA）模型，已展现出巨大潜力。然而，这些系统通常受限于高昂的计算成本、需要大量重新训练以及有限的可扩展性，使其难以广泛普及。本文介绍SVLR（可扩展视觉语言机器人），这是一个开源、模块化的框架，无需重新训练即可运行，为机器人控制提供了一个可扩展的解决方案。SVLR利用一组轻量级开源AI模型的组合来处理视觉和语言输入，包括视觉语言模型（VLM）Mini-InternVL、零样本图像分割模型CLIPSeg、大型语言模型Phi-3以及句子相似度模型all-MiniLM。这些模型协同工作，以识别未知环境中的物体，将其作为任务执行的参数，并根据自然语言指令生成一系列动作。SVLR的一个关键优势在于其可扩展性。该框架允许通过简单地添加文本描述和任务定义来轻松集成新的机器人任务和机器人，而无需重新训练。这种模块化设计确保了SVLR能够持续适应AI技术的最新进展，并支持广泛的机器人和任务。SVLR在NVIDIA RTX 2070（移动版）GPU上有效运行，在执行抓取放置任务中展现出良好的性能。虽然这些初步结果令人鼓舞，但仍需在更广泛的任务集上进行进一步评估，并与现有VLA模型进行比较，以评估SVLR在更复杂场景中的泛化能力和性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日