Natural Language Processing (NLP) systems are increasingly taking the form of multi-stage pipelines involving multiple distinct language models (LMs) and prompting strategies. Here we address the question of how to fine-tune such systems to improve their performance. We cast this as a problem of optimizing the underlying LM weights and the prompting strategies together, and consider a challenging but highly realistic scenario in which we have no gold labels for any intermediate stages in the pipeline. To address this challenge, we evaluate approximate optimization strategies in which we bootstrap training labels for all pipeline stages and use these to optimize the pipeline's prompts and fine-tune its weights alternatingly. In experiments with multi-hop QA, mathematical reasoning, and feature-based classification, we find that simple approaches for optimizing the prompts and weights together outperform directly optimizing weights alone and prompts alone by up to 65% and 5%, respectively, on average across LMs and tasks. We will release our new optimizers in DSPy at http://dspy.ai
翻译:自然语言处理(NLP)系统日益呈现为包含多个独立语言模型(LM)及提示策略的多阶段流水线结构。本文旨在探讨如何通过微调提升此类系统的性能。我们将此问题归结为联合优化底层语言模型权重与提示策略的挑战,并考虑一种极具挑战性但高度现实的场景:流水线中任何中间阶段均缺乏黄金标注数据。为应对这一挑战,我们评估了近似优化策略:通过自举方法为所有流水线阶段生成训练标签,并交替使用这些标签优化流水线提示策略与微调模型权重。在多跳问答、数学推理及基于特征的分类任务实验中,我们发现:在语言模型与任务的平均表现上,简单的提示与权重联合优化方法分别比单独优化权重或单独优化提示策略平均提升65%和5%。我们将在http://dspy.ai的DSPy框架中发布新型优化器。