Ensuring Large Language Models (LLMs) align with diverse human preferences while preserving privacy and fairness remains a challenge. Existing methods, such as Reinforcement Learning from Human Feedback (RLHF), rely on centralized data collection, making them computationally expensive and privacy-invasive. We introduce PluralLLM a federated learning-based approach that enables multiple user groups to collaboratively train a transformer-based preference predictor without sharing sensitive data, which can also serve as a reward model for aligning LLMs. Our method leverages Federated Averaging (FedAvg) to aggregate preference updates efficiently, achieving 46% faster convergence, a 4% improvement in alignment scores, and nearly the same group fairness measure as in centralized training. Evaluated on a Q/A preference alignment task, PluralLLM demonstrates that federated preference learning offers a scalable and privacy-preserving alternative for aligning LLMs with diverse human values.
翻译:确保大语言模型(LLMs)在保护隐私和公平性的同时与多样化的人类偏好对齐,仍然是一个挑战。现有方法(如基于人类反馈的强化学习(RLHF))依赖于集中式数据收集,导致计算成本高昂且存在隐私侵犯风险。本文提出PluralLLM,一种基于联邦学习的方法,允许多个用户群体在不共享敏感数据的情况下协作训练基于Transformer的偏好预测器,该预测器也可作为对齐LLMs的奖励模型。我们的方法利用联邦平均(FedAvg)高效聚合偏好更新,实现了收敛速度提升46%,对齐分数提高4%,且群体公平性指标与集中式训练几乎相同。在问答偏好对齐任务上的评估表明,PluralLLM证明联邦偏好学习为对齐LLMs与多样化人类价值观提供了一种可扩展且保护隐私的替代方案。