In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that navigates multiple values to better elicit LLMs' understanding of them and improve their alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, theoretically reinforcing value correlation while reducing distractive noise, resulting in effective value instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.
翻译:摘要:情境学习已被证明能有效将大型语言模型与人类价值对齐(即情境价值对齐),有助于减少有害输出并适应多样化偏好,而无需昂贵的事后训练。然而,LLM对输入提示的理解仍具有黑箱特性,这限制了ICA处理价值冲突的能力——人类价值本质上具有多元性,常施加相互矛盾的要求(如刺激性与传统性)。现有ICA方法因此面临"指令瓶颈"挑战:LLM难以在单次提示中调和多重预期价值,导致对齐不完整或存在偏差。为此,我们提出PICACO——一种新型多元ICA方法。无需微调,PICACO通过优化元指令来引导多重价值,从而提升LLM对价值内涵的解读能力并改善对齐效果。其核心机制是最大化指定价值与LLM响应间的总相关性,理论上通过增强价值关联性并抑制干扰噪声,生成高效价值指令。在五组价值集上的广泛实验表明,PICACO适用于黑盒与开源LLM,性能超越多个近期强基线模型,在多达8种不同价值间实现更优平衡。