This work studies heterogeneous Multi-Objective Reinforcement Learning (MORL), where objectives can differ sharply in temporal frequency. Such heterogeneity allows dense objectives to dominate learning, while sparse long-horizon rewards receive weak credit assignment, leading to poor sample efficiency. We propose a Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels. PRISM introduces ReSymNet, a theory-motivated model that reconciles temporal-frequency mismatches across objectives, using residual blocks to learn a scaled opportunity value that accelerates exploration while preserving the optimal policy. We also propose SymReg, a reflectional equivariance regulariser that enforces agent mirroring and constrains policy search to a reflection-equivariant subspace. This restriction provably reduces hypothesis complexity and improves generalisation. Across MuJoCo benchmarks, PRISM consistently outperforms both a sparse-reward baseline and an oracle trained with full dense rewards, improving Pareto coverage and distributional balance: it achieves hypervolume gains exceeding 100\% over the baseline and up to 32\% over the oracle. The code is at \href{https://github.com/EVIEHub/PRISM}{https://github.com/EVIEHub/PRISM}.
翻译:本文研究异构多目标强化学习(MORL),其中各目标在时间频率上可能存在显著差异。这种异质性导致密集目标主导学习过程,而稀疏的长时程奖励则获得较弱的信用分配,从而造成样本效率低下。我们提出一种基于对称性的并行奖励集成(PRISM)算法,该算法通过强制反射对称性作为归纳偏置来对齐奖励通道。PRISM引入了理论驱动的ReSymNet模型,该模型利用残差块学习缩放机会价值,以协调目标间的时间频率失配,在加速探索的同时保持最优策略。我们还提出SymReg——一种反射等变正则化器,通过强制智能体镜像行为将策略搜索约束在反射等变子空间内。理论证明该限制可降低假设复杂度并提升泛化能力。在MuJoCo基准测试中,PRISM持续优于稀疏奖励基线模型及使用全密集奖励训练的预言机模型,显著提升了帕累托覆盖度与分布平衡性:其超体积指标较基线提升超过100%,较预言机最高提升32%。代码发布于 \href{https://github.com/EVIEHub/PRISM}{https://github.com/EVIEHub/PRISM}。