DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack

Evaluating a Physical AI stack spans operators that differ by more than three orders of magnitude -- from a single foundation-model decoding step to thousands of physics ticks of whole-body control -- varying orthogonally in modality, reward semantics, and resource profile. No existing framework spans this range, so the stack is evaluated today by stitching together separate harnesses that share neither runtime nor scoring, preserving each segment's local validity but losing the shared identity needed to diagnose cross-layer regressions. We present DeepInsight, an evaluation infrastructure that serves this full spectrum on a single runtime. Rather than homogenize the regimes, it preserves their heterogeneity behind three narrow abstractions -- task, resource, and result -- each realized as one invariant shared by every subsystem: one episode driver, one resource-handle protocol implemented by every expensive backend (LLM inference and sandboxed runtimes alike), and one trace identity scheme under which every event is written. Deployed in production across all three layers of an embodied humanoid stack, this single set of invariants onboards new benchmarks largely by configuration. Where mature peer orchestrators exist -- at the foundation-model end -- it reproduces published references and peer-framework readings within their own spread, runs the same suites faster on a single node, and scales near-linearly across nodes. Its distinctive return is diagnostic: because every layer writes into one shared trace, a regression that begins in one layer and surfaces in another stays localizable on that trace -- a cross-layer payoff no federation of per-segment harnesses can reproduce.

翻译：评估物理人工智能全栈涉及跨越三个数量级以上差异的算子——从单次基础模型解码步骤到数千次全身控制物理滴答——并在模态、奖励语义和资源属性上正交变化。现有框架无法覆盖这一范围，当前全栈评估需通过拼接多个独立测试框架实现，这些框架既不共享运行时也不共享评分机制，虽保留了各局部的有效性，但丧失了诊断跨层回归问题所需的共享标识。我们提出DeepInsight——一种在单一运行时上服务全频谱的评估基础设施。该框架并未强行统一各领域，而是通过三个精简抽象层——任务、资源和结果——保留其异构性，每个抽象均实现为所有子系统共享的不变量：一个回合驱动器、一个由所有昂贵后端（包括LLM推理与沙盒运行时）实现的资源句柄协议，以及一个使所有事件均可被写入的追踪标识方案。该方案在具身人形机器人全栈的三层生产环境中部署后，仅通过配置即可新增基准测试。在基础模型端等成熟的同类编排器领域，它能在其自身误差范围内复现已发表文献与同类框架的读数，在单节点上以更快速度运行相同测试套件，并实现近线性跨节点扩展。其独特价值在于诊断能力：由于每一层都写入同一共享追踪，始于某层而显现在另一层的回归问题可在此追踪中准确定位——这种跨层收益是任何分段测试框架联邦化方案都无法复现的。