The rapid advancement of large language models (LLMs) has facilitated their transformation into conversational chatbots that can grasp contextual nuances and generate pertinent sentences, closely mirroring human values through advanced techniques such as instruction tuning and reinforcement learning from human feedback (RLHF). However, the computational efficiency required for LLMs, achieved through techniques like post-training quantization (PTQ), presents challenges such as token-flipping that can impair chatbot performance. In response, we propose a novel preference alignment approach, quantization-aware direct preference optimization (QDPO), that aligns quantized LLMs with their full-precision counterparts, improving conversational abilities. Evaluated on two instruction-tuned LLMs in various languages, QDPO demonstrated superior performance in improving conversational abilities compared to established PTQ and knowledge-distillation fine-tuning techniques, marking a significant step forward in the development of efficient and effective conversational LLMs.
翻译:大型语言模型(LLM)的快速发展,通过指令微调和基于人类反馈的强化学习(RLHF)等先进技术,已促使其转化为能够理解上下文细微差别并生成相关语句的对话式聊天机器人,从而更贴近人类价值观。然而,通过训练后量化(PTQ)等技术实现LLM计算效率的同时,也带来了诸如词元翻转等可能损害聊天机器人性能的挑战。为此,我们提出了一种新颖的偏好对齐方法——量化感知直接偏好优化(QDPO),该方法将量化后的LLM与其全精度对应模型对齐,从而提升对话能力。通过在多种语言的两个指令微调LLM上进行评估,QDPO在提升对话能力方面展现出优于现有PTQ及知识蒸馏微调技术的性能,标志着高效且有效的对话式LLM发展迈出了重要一步。