LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an answer. For instance, on a programming task, an LLM should test a generated code snippet if it is uncertain about the correctness of that code; the cost of writing a test is nonzero, but typically lower than the cost of making a mistake. In this work, we show that we can induce LLMs to explicitly reason about balancing these cost-uncertainty tradeoffs, then perform more optimal environment exploration. We formalize multiple tasks, including information retrieval and coding, as sequential decision-making problems under uncertainty. Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent. We introduce a framework called Calibrate-Then-Act (CTA), where we feed the LLM this additional context to enable it to act more optimally. This improvement is preserved even under RL training of both the baseline and CTA. Our results on information-seeking QA and on a simplified coding task show that making cost-benefit tradeoffs explicit with CTA can help agents discover more optimal decision-making strategies.
翻译:大型语言模型(LLM)正日益被用于解决复杂问题,这些问题通常无法通过单次响应完成,而需要与环境交互以获取信息。在此类场景中,LLM必须权衡固有的成本与不确定性,以决定何时停止探索并给出最终答案。例如,在编程任务中,若LLM对生成代码片段的正确性不确定,则应编写测试进行验证;编写测试虽需成本,但通常低于错误代码导致的代价。本研究表明,通过引导LLM显式推理此类成本-不确定性的权衡关系,可实现更优化的环境探索策略。我们将信息检索与代码生成等多类任务形式化为不确定性下的序列决策问题。每个问题均包含可通过先验分布推理的潜在环境状态,该先验信息被输入至LLM智能体。我们提出名为"校准后行动"(CTA)的框架,通过向LLM提供额外上下文信息,使其能够执行更优化的行动。即使对基线模型和CTA模型同时进行强化学习训练,这种改进依然保持稳定。在信息检索问答任务及简化编程任务上的实验结果表明,通过CTA显式权衡成本与收益,可帮助智能体发现更优的决策策略。