Robots that follow natural-language instructions often either plan at a high level using hand-designed interfaces or rely on large end-to-end models that are difficult to deploy for real-time control. We propose TeNet (Text-to-Network), a framework for instantiating compact, task-specific robot policies directly from natural language descriptions. TeNet conditions a hypernetwork on text embeddings produced by a pretrained large language model (LLM) to generate a fully executable policy, which then operates solely on low-dimensional state inputs at high control frequencies. By using the language only once at the policy instantiation time, TeNet inherits the general knowledge and paraphrasing robustness of pretrained LLMs while remaining lightweight and efficient at execution time. To improve generalization, we optionally ground language in behavior during training by aligning text embeddings with demonstrated actions, while requiring no demonstrations at inference time. Experiments on MuJoCo and Meta-World benchmarks show that TeNet produces policies that are orders of magnitude smaller than sequence-based baselines, while achieving strong performance in both multi-task and meta-learning settings and supporting high-frequency control. These results show that text-conditioned hypernetworks offer a practical way to build compact, language-driven controllers for ressource-constrained robot control tasks with real-time requirements.
翻译:遵循自然语言指令的机器人通常采用两种方式:要么通过人工设计的接口进行高层规划,要么依赖难以部署于实时控制的大型端到端模型。我们提出TeNet(文本到网络),一种直接从自然语言描述实例化紧凑型任务专用机器人策略的框架。TeNet通过预训练大语言模型(LLM)生成的文本嵌入条件化超网络,从而生成完全可执行的策略;该策略随后仅依赖低维状态输入,以高控制频率运行。由于仅在策略实例化阶段使用一次语言描述,TeNet既继承了预训练LLM的通用知识与语义泛化鲁棒性,又在执行阶段保持轻量化与高效率。为提升泛化能力,我们在训练中通过将文本嵌入与演示动作对齐,实现语言在行为层面的可选择性锚定,且推理阶段无需任何演示数据。在MuJoCo和Meta-World基准测试中的实验表明,TeNet生成的策略比基于序列的基线模型小数个数量级,同时在多任务与元学习场景中均表现出色,并支持高频控制。这些结果表明,文本条件化超网络为资源受限且具有实时需求的机器人控制任务,提供了一种构建紧凑型语言驱动控制器的实用途径。