We argue that the Declarative Self-improving Python (DSPy) optimizers are a way to align the large language model (LLM) prompts and their evaluations to the human annotations. We present a comparative analysis of five teleprompter algorithms, namely, Cooperative Prompt Optimization (COPRO), Multi-Stage Instruction Prompt Optimization (MIPRO), BootstrapFewShot, BootstrapFewShot with Optuna, and K-Nearest Neighbor Few Shot, within the DSPy framework with respect to their ability to align with human evaluations. As a concrete example, we focus on optimizing the prompt to align hallucination detection (using LLM as a judge) to human annotated ground truth labels for a publicly available benchmark dataset. Our experiments demonstrate that optimized prompts can outperform various benchmark methods to detect hallucination, and certain telemprompters outperform the others in at least these experiments.
翻译:我们认为,声明式自改进Python(DSPy)优化器是一种将大型语言模型(LLM)提示及其评估与人类标注对齐的方法。我们对DSPy框架内的五种teleprompter算法——即协作提示优化(COPRO)、多阶段指令提示优化(MIPRO)、BootstrapFewShot、基于Optuna的BootstrapFewShot以及K近邻少样本学习(K-Nearest Neighbor Few Shot)——在与人机评估对齐能力方面进行了比较分析。作为一个具体案例,我们专注于优化提示,以使幻觉检测(使用LLM作为评判者)与公开基准数据集中人类标注的真实标签对齐。我们的实验表明,优化后的提示在检测幻觉方面能够超越多种基准方法,并且某些teleprompter算法至少在本次实验中表现优于其他算法。