Millions of users prompt large language models (LLMs) for various tasks, but how good are people at prompt engineering? Do users actually get closer to their desired outcome over multiple iterations of their prompts? These questions are crucial when no gold-standard labels are available to measure progress. This paper investigates a scenario in LLM-powered data labeling, "prompting in the dark," where users iteratively prompt LLMs to label data without using manually-labeled benchmarks. We developed PromptingSheet, a Google Sheets add-on that enables users to compose, revise, and iteratively label data through spreadsheets. Through a study with 20 participants, we found that prompting in the dark was highly unreliable-only 9 participants improved labeling accuracy after four or more iterations. Automated prompt optimization tools like DSPy also struggled when few gold labels were available. Our findings highlight the importance of gold labels and the needs, as well as the risks, of automated support in human prompt engineering, providing insights for future tool design.
翻译:数百万用户针对各类任务提示大型语言模型(LLM),但人们在提示工程方面的能力究竟如何?用户是否真的能通过多轮提示迭代更接近期望结果?当缺乏衡量进展的黄金标准标签时,这些问题至关重要。本文研究了LLM驱动数据标注中的一种场景——“在黑暗中提示”,即用户在不使用人工标注基准的情况下,通过迭代提示LLM完成数据标注。我们开发了PromptingSheet——一款谷歌表格插件,使用户能够通过电子表格撰写、修订并迭代标注数据。通过对20名参与者的研究发现,在黑暗中提示具有高度不确定性——仅有9名参与者在四次及以上迭代后提升了标注准确率。即使如DSPy等自动化提示优化工具,在黄金标签稀缺时也表现不佳。我们的研究结果揭示了黄金标准标签的重要性,以及人类提示工程中自动化支持的需求与风险,为未来工具设计提供了重要启示。