Even LLMs that appear safe during evaluation can still produce harmful responses in deployment. Because stochastic sampling yields different responses to the same prompt, low-probability harmful outputs can still reach users at scale. Common human evaluation workflows generate many random samples per prompt and review them in static spreadsheets. The practice scales poorly, forcing evaluators to repeatedly reread near-duplicate prefixes. To address this, we present InFerActive, an interactive system that visualizes sampling results as a navigable tree of readable phrases, allowing evaluators to filter, explore, and extend the generation space on demand. InFerActive utilizes breadth-first sampling, a novel tree construction procedure that matches the harmful-response coverage of random sampling while requiring up to 5.0x fewer samples. Two controlled user studies (N = 12 each) demonstrate that InFerActive significantly improves evaluation efficiency and coverage over both spreadsheet and basic tree baselines.
翻译:摘要:即使在评估期间表现安全的LLM,在部署时仍可能生成有害响应。由于随机采样对同一提示会产生不同响应,低概率的有害输出仍可能大规模触及用户。常见的人工评估工作流会对每个提示生成大量随机样本,并在静态电子表格中审查这些样本。这种方法的扩展性较差,迫使评估者反复阅读近乎重复的前缀。为解决这一问题,我们提出InFerActive——一种交互式系统,它将采样结果可视化为可导航的短语树形结构,使评估者能够按需过滤、探索和扩展生成空间。InFerActive采用广度优先采样这一新颖的树构建流程,在匹配随机采样有害响应覆盖率的同时,所需样本量减少高达5.0倍。两项受控用户研究(每组N = 12)表明,InFerActive在评估效率和覆盖率上显著优于电子表格及基础树基线方法。