Evaluating outputs of large language models (LLMs) is challenging, requiring making -- and making sense of -- many responses. Yet tools that go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. ChainForge provides a graphical interface for comparison of responses across models and prompt variations. Our system was designed to support three tasks: model selection, prompt template design, and hypothesis testing (e.g., auditing). We released ChainForge early in its development and iterated on its design with academics and online users. Through in-lab and interview studies, we find that a range of people could use ChainForge to investigate hypotheses that matter to them, including in real-world settings. We identify three modes of prompt engineering and LLM hypothesis testing: opportunistic exploration, limited evaluation, and iterative refinement.
翻译:评估大型语言模型(LLM)的输出具有挑战性,需要生成并理解大量响应。然而,现有工具或要求掌握编程API知识,或聚焦于狭窄领域,或为闭源软件,难以满足基本提示之外的需求。本文提出ChainForge——一个用于文本生成LLM提示工程与按需假设检验的开源可视化工具包。ChainForge提供图形界面,支持跨模型与提示变体的响应对比。该系统旨在支持三项任务:模型选择、提示模板设计及假设检验(如审计)。我们在开发早期即发布ChainForge,并与学术界及在线用户共同迭代设计。通过实验室研究与访谈分析,我们发现不同背景的用户(包括实际场景用户)可使用ChainForge探究其关注的假设。我们识别出LLM提示工程与假设检验的三种模式:机会性探索、有限评估与迭代改进。