We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
翻译:我们提出了 LLM-in-Sandbox,使大型语言模型能够在代码沙盒(即虚拟计算机)内进行探索,以激发其在非代码领域的通用智能。我们首先证明,强大的大型语言模型无需额外训练,即展现出利用代码沙盒处理非代码任务的泛化能力。例如,大型语言模型能自发访问外部资源以获取新知识,利用文件系统处理长上下文,并执行脚本来满足格式要求。我们进一步表明,这些智能体能力可以通过 LLM-in-Sandbox 强化学习(LLM-in-Sandbox-RL)得到增强,该方法仅使用非智能体数据来训练模型进行沙盒探索。实验表明,LLM-in-Sandbox 在免训练和训练后两种设置下,均实现了跨越数学、物理、化学、生物医学、长上下文理解及指令遵循的鲁棒泛化。最后,我们从计算和系统角度分析了 LLM-in-Sandbox 的效率,并将其开源为 Python 软件包以促进实际部署。