The prevailing paradigm in AI for physical systems, scaling general-purpose foundation models toward universal multimodal reasoning, confronts a fundamental barrier at the control interface. Recent benchmarks show that even frontier vision-language models achieve only 50-53% accuracy on basic quantitative physics tasks, behaving as approximate guessers that preserve semantic plausibility while violating physical constraints. This input unfaithfulness is not a scaling deficiency but a structural limitation. Perception-centric architectures optimize parameter-space imitation, whereas safety-critical control demands outcome-space guarantees over executed actions. Here, we present a fundamentally different pathway toward domain-specific foundation models by introducing compact language models operating as Agentic Physical AI, in which policy optimization is driven by physics-based validation rather than perceptual inference. We train a 360-million-parameter model on synthetic reactor control scenarios, scaling the dataset from 10^3 to 10^5 examples. This induces a sharp phase transition absent in general-purpose models. Small-scale systems exhibit high-variance imitation with catastrophic tail risk, while large-scale models undergo variance collapse exceeding 500x reduction, stabilizing execution-level behavior. Despite balanced exposure to four actuation families, the model autonomously rejects approximately 70% of the training distribution and concentrates 95% of runtime execution on a single-bank strategy. Learned representations transfer across distinct physics and continuous input modalities without architectural modification.
翻译:当前面向物理系统的人工智能主流范式,即通过扩展通用基础模型实现通用多模态推理,在控制接口处遇到了根本性障碍。近期基准测试表明,即使是前沿的视觉-语言模型,在基础定量物理任务上的准确率也仅为50-53%,其行为近似于猜测器,虽能保持语义合理性却违反了物理约束。这种输入不忠实性并非规模扩展的不足,而是一种结构性局限。以感知为中心的架构优化的是参数空间的模仿,而安全关键的控制则要求对已执行动作在结果空间上提供保证。在此,我们提出了一条通向领域特定基础模型的根本性不同路径,即引入作为能动物理人工智能运行的紧凑语言模型,其策略优化由基于物理的验证驱动,而非感知推理。我们在合成反应堆控制场景上训练了一个3.6亿参数模型,将数据集规模从10^3扩展到10^5个样本。这引发了一个在通用模型中不存在的急剧相变。小规模系统表现出高方差的模仿行为并伴有灾难性的尾部风险,而大规模模型则经历了超过500倍的方差坍缩,从而稳定了执行层面的行为。尽管模型均衡地接触了四种执行器家族,但它自主地拒绝了约70%的训练分布,并将95%的运行执行集中在单一策略组上。学习到的表征无需架构修改即可迁移到不同的物理系统和连续输入模态中。