High-dimensional stochastic optimal control (SOC) becomes harder with longer planning horizons: existing methods scale linearly in the horizon $T$, with performance often deteriorating exponentially. We overcome these limitations for a subclass of linearly-solvable SOC problems-those whose uncontrolled drift is the gradient of a potential. In this setting, the Hamilton-Jacobi-Bellman equation reduces to a linear PDE governed by an operator $\mathcal{L}$. We prove that, under the gradient drift assumption, $\mathcal{L}$ is unitarily equivalent to a Schrödinger operator $\mathcal{S} = -Δ+ \mathcal{V}$ with purely discrete spectrum, allowing the long-horizon control to be efficiently described via the eigensystem of $\mathcal{L}$. This connection provides two key results: first, for a symmetric linear-quadratic regulator (LQR), $\mathcal{S}$ matches the Hamiltonian of a quantum harmonic oscillator, whose closed-form eigensystem yields an analytic solution to the symmetric LQR with \emph{arbitrary} terminal cost. Second, in a more general setting, we learn the eigensystem of $\mathcal{L}$ using neural networks. We identify implicit reweighting issues with existing eigenfunction learning losses that degrade performance in control tasks, and propose a novel loss function to mitigate this. We evaluate our method on several long-horizon benchmarks, achieving an order-of-magnitude improvement in control accuracy compared to state-of-the-art methods, while reducing memory usage and runtime complexity from $\mathcal{O}(Td)$ to $\mathcal{O}(d)$.
翻译:高维随机最优控制(SOC)随着规划时域变长而愈加困难:现有方法的计算复杂度随时间 $T$ 线性增长,且性能往往呈指数级恶化。我们针对一类可线性求解的SOC问题——即其无控漂移量为势能梯度的子类——克服了上述局限。在此设定下,Hamilton-Jacobi-Bellman方程简化为由算子 $\mathcal{L}$ 控制的线性偏微分方程。我们证明,在梯度漂移假设下,$\mathcal{L}$ 通过幺正变换等价于具有纯离散谱的薛定谔算子 $\mathcal{S} = -Δ+ \mathcal{V}$,从而可通过 $\mathcal{L}$ 的本征系统高效描述长时域控制问题。该关联提供两个关键结果:首先,对于对称线性二次型调节器(LQR),$\mathcal{S}$ 与量子谐振子的哈密顿量形式一致,其闭式本征系统可为\textit{任意}终端代价函数的对称LQR问题提供解析解。其次,在更一般设定下,我们利用神经网络学习 $\mathcal{L}$ 的本征系统。我们指出现有本征函数学习损失函数中存在的隐式重加权问题会降低控制任务性能,并提出一种新型损失函数予以缓解。我们在多个长时域基准测试中评估该方法,与现有最优方法相比,控制精度提升一个数量级,同时将内存消耗与运行时复杂度从 $\mathcal{O}(Td)$ 降至 $\mathcal{O}(d)$。