In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that navigates multiple values to better elicit LLMs' understanding of them and improve their alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, theoretically reinforcing value correlation while reducing distractive noise, resulting in effective value instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.
翻译:上下文学习在将大型语言模型与人类价值观对齐方面展现出巨大潜力,有助于减少有害输出并适应多样化偏好,而无需昂贵的训练后调整,这一过程被称为上下文对齐。然而,LLMs对输入提示的理解仍具有不可知性,限制了ICA处理价值冲突的能力——人类价值观本质上是多元的,常会施加相互矛盾的要求,例如刺激性与传统性。现有ICA方法因此面临指令瓶颈挑战,即LLMs难以在单一提示中协调多个目标价值,导致对齐不完整或存在偏差。为解决此问题,我们提出PICACO,一种新颖的多元ICA方法。该方法无需微调,通过优化元指令来协调多个价值维度,从而更好地激发LLMs对价值的理解并提升对齐效果。其核心在于最大化指定价值与LLM响应之间的全相关性,理论上可强化价值关联性并减少干扰噪声,最终生成高效的价值指令。在五个价值数据集上的大量实验表明,PICACO在黑盒与开源LLMs上均表现优异,超越多个近期强基线,并在多达8种不同价值维度间实现了更优的平衡。