Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets. Existing data selection methods typically score each preference pair independently, collapsing directional preference information into scalar quality or diversity scores. This sample-centric view is especially limiting in multi-dataset settings, where shared safety directions coexist with dataset-specific residual risks. We propose DOG-DPO, a training-free data selection framework that treats preference pairs as structured geometric signals. DOG-DPO first represents each preference pair as a direction in model representation space. It then decomposes multi-dataset preference geometry into a global anchor subspace and dataset-specific residual subspaces. Finally, it selects subsets by maximizing diversity-based coverage, encouraging broad, non-redundant coverage of alignment directions before DPO training. Across six safety benchmarks and two model backbones, DOG-DPO achieves a strong utility-robustness trade-off using only 11% of the preference pairs. It recovers most of the safety gains of full-data training while remaining entirely teacher-free, training-free, and substantially faster than representative selection baselines.
翻译:大语言模型的安全对齐依赖于偏好数据,但当前流程通常使用庞大且冗余的数据集进行训练。现有数据选择方法通常独立为每个偏好对评分,将方向性偏好信息坍缩为标量质量或多样性分数。这种以样本为中心的视角在多数据集场景中尤为受限,因为全局共享的安全方向与数据集特有的残留风险共存。我们提出DOG-DPO——一种无需训练的数据选择框架,将偏好对视为结构化几何信号。DOG-DPO首先将每个偏好对表示为模型表征空间中的一个方向;随后将多数据集偏好的几何结构分解为全局锚定子空间与数据集特有残留子空间;最后通过最大化基于多样性的覆盖度来选取子集,在DPO训练前鼓励对齐方向的广泛且无冗余覆盖。在六个安全基准与两个模型主干上,DOG-DPO仅使用11%的偏好对即可实现稳健的效用-鲁棒性权衡。该方法在完全无需教师模型、无需额外训练且速度显著优于代表性选择基线的前提下,恢复了全数据训练的大部分安全增益。