Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions. However, recent studies have shown significant limitations in LLMs' instruction-following capabilities, raising concerns about their reliability in high-stakes applications. Accurately estimating LLMs' uncertainty in adhering to instructions is critical to mitigating deployment risks. We present, to our knowledge, the first systematic evaluation of the uncertainty estimation abilities of LLMs in the context of instruction-following. Our study identifies key challenges with existing instruction-following benchmarks, where multiple factors are entangled with uncertainty stems from instruction-following, complicating the isolation and comparison across methods and models. To address these issues, we introduce a controlled evaluation setup with two benchmark versions of data, enabling a comprehensive comparison of uncertainty estimation methods under various conditions. Our findings show that existing uncertainty methods struggle, particularly when models make subtle errors in instruction following. While internal model states provide some improvement, they remain inadequate in more complex scenarios. The insights from our controlled evaluation setups provide a crucial understanding of LLMs' limitations and potential for uncertainty estimation in instruction-following tasks, paving the way for more trustworthy AI agents.
翻译:大型语言模型(LLMs)若能精确遵循用户指令,有望成为跨多个领域的个人AI助手。然而,近期研究揭示了LLMs在指令跟随能力上的显著局限,引发了对其在高风险应用中可靠性的担忧。准确评估LLMs遵循指令时的不确定性,对于降低部署风险至关重要。据我们所知,本研究首次系统性地评估了LLMs在指令跟随情境下的不确定性估计能力。我们的研究揭示了现有指令跟随基准测试的关键问题:多个因素与指令跟随产生的不确定性相互交织,导致难以对不同方法和模型进行隔离比较。为解决这些问题,我们引入了一个受控评估框架,包含两个数据版本的基准测试,从而能在不同条件下全面比较不确定性估计方法。研究结果表明,现有不确定性方法在模型出现细微指令跟随错误时表现欠佳。虽然利用模型内部状态能带来一定改进,但在更复杂的场景中仍显不足。通过受控评估框架获得的洞见,为理解LLMs在指令跟随任务中不确定性估计的局限性与潜力提供了关键依据,为开发更可信的AI助手奠定了基础。