We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Further, both uncertainty estimation and the choice of exploration scheme play critical roles.
翻译:我们提供了证据表明,在收集人类反馈以改进大语言模型时,高效探索能带来显著收益。实验中,智能体依次生成查询,同时根据收到的反馈拟合奖励模型。表现最佳的智能体使用双重汤普森采样生成查询,其不确定性由认知神经网络表示。我们的结果表明,高效探索能以更少的查询实现高水平性能。此外,不确定性估计和探索方案的选择均发挥着关键作用。