This paper proposes a new method for preventing unsafe or otherwise low quality large language model (LLM) outputs, by leveraging the stochasticity of LLMs. We propose a system whereby LLM checkers vote on the acceptability of a generated output, regenerating it if a threshold of disapproval is reached, until sufficient checkers approve. We further propose estimators for cost and failure rate, and based on those estimators and experimental data tailored to the application, we propose an algorithm that achieves a desired failure rate at the least possible cost. We demonstrate that, under these models, failure rate decreases exponentially as a function of cost when voter count and threshold are chosen according to the algorithm, and that the models reasonably estimate the actual performance of such a system in action, even with limited data.
翻译:本文提出了一种新方法,通过利用大语言模型(LLM)的随机性来防止不安全或低质量的LLM输出。我们设计了一个系统,其中多个LLM检查器对生成输出的可接受性进行投票;若反对票达到阈值,则重新生成输出,直至足够多的检查器认可为止。我们进一步提出了成本和失败率的估计器,并基于这些估计器及针对具体应用定制的实验数据,提出了一种算法,该算法能以最低成本实现期望的失败率。我们证明,在该模型下,当投票者数量和阈值根据算法选取时,失败率随成本增加呈指数级下降;同时,即使数据有限,该模型也能合理估计此类系统在实际运行中的性能。