On-the-fly Directed Controller Synthesis (OTF-DCS) mitigates state-space explosion by incrementally exploring the system and relies critically on an exploration policy to guide search efficiently. Recent reinforcement learning (RL) approaches learn such policies and achieve promising zero-shot generalization from small training instances to larger unseen ones. However, a fundamental limitation is anisotropic generalization, where an RL policy exhibits strong performance only in a specific region of the domain-parameter space while remaining fragile elsewhere due to training stochasticity and trajectory-dependent bias. To address this, we propose a Soft Mixture-of-Experts framework that combines multiple RL experts via a prior-confidence gating mechanism and treats these anisotropic behaviors as complementary specializations. The evaluation on the Air Traffic benchmark shows that Soft-MoE substantially expands the solvable parameter space and improves robustness compared to any single expert.
翻译:在线定向控制器综合通过增量式探索系统来缓解状态空间爆炸问题,其关键依赖于探索策略以高效引导搜索。最近的强化学习方法通过学习此类策略,在从小型训练实例到未见大型实例的零样本泛化方面展现出良好前景。然而,一个根本性局限在于各向异性泛化:由于训练随机性和轨迹依赖性偏差,强化学习策略仅在域参数空间的特定区域表现优异,而在其他区域则表现脆弱。为解决此问题,我们提出一种软专家混合框架,该框架通过先验置信度门控机制融合多个强化学习专家,并将这些各向异性行为视为互补的专业化能力。在空管基准测试上的评估表明,相较于任何单一专家,软专家混合框架显著扩展了可求解参数空间并提升了鲁棒性。