GraphAllocBench：一种面向偏好条件多目标策略学习的灵活基准 (GraphAllocBench: A Flexible Benchmark for Preference-Conditioned Multi-Objective Policy Learning)

Preference-Conditioned Policy Learning (PCPL) in Multi-Objective Reinforcement Learning (MORL) aims to approximate diverse Pareto-optimal solutions by conditioning policies on user-specified preferences over objectives. This enables a single model to flexibly adapt to arbitrary trade-offs at run-time by producing a policy on or near the Pareto front. However, existing benchmarks for PCPL are largely restricted to toy tasks and fixed environments, limiting their realism and scalability. To address this gap, we introduce GraphAllocBench, a flexible benchmark built on a novel graph-based resource allocation sandbox environment inspired by city management, which we call CityPlannerEnv. GraphAllocBench provides a rich suite of problems with diverse objective functions, varying preference conditions, and high-dimensional scalability. We also propose two new evaluation metrics -- Proportion of Non-Dominated Solutions (PNDS) and Ordering Score (OS) -- that directly capture preference consistency while complementing the widely used hypervolume metric. Through experiments with Multi-Layer Perceptrons (MLPs) and graph-aware models, we show that GraphAllocBench exposes the limitations of existing MORL approaches and paves the way for using graph-based methods such as Graph Neural Networks in complex, high-dimensional combinatorial allocation tasks. Beyond its predefined problem set, GraphAllocBench enables users to flexibly vary objectives, preferences, and allocation rules, establishing it as a versatile and extensible benchmark for advancing PCPL. Code: https://anonymous.4open.science/r/GraphAllocBench

翻译：多目标强化学习（MORL）中的偏好条件策略学习（PCPL）旨在通过将策略条件化于用户对目标指定的偏好，来逼近多样化的帕累托最优解。这使得单一模型能够在运行时通过生成位于或接近帕累托前沿的策略，灵活适应任意的权衡取舍。然而，现有的PCPL基准大多局限于玩具任务和固定环境，限制了其实用性和可扩展性。为填补这一空白，我们提出了GraphAllocBench，这是一个构建在新型基于图的资源分配沙盒环境上的灵活基准，该环境受城市管理启发，我们称之为CityPlannerEnv。GraphAllocBench提供了一系列丰富的问题，涵盖多样化的目标函数、可变的偏好条件以及高维可扩展性。我们还提出了两个新的评估指标——非支配解比例（PNDS）和排序得分（OS）——它们直接捕捉偏好一致性，同时补充了广泛使用的超体积指标。通过使用多层感知机（MLP）和图感知模型进行实验，我们表明GraphAllocBench揭示了现有MORL方法的局限性，并为在复杂、高维组合分配任务中使用基于图的方法（如图神经网络）铺平了道路。除了预定义的问题集，GraphAllocBench还允许用户灵活地改变目标、偏好和分配规则，使其成为一个推动PCPL发展的通用且可扩展的基准。代码：https://anonymous.4open.science/r/GraphAllocBench