Large language models (LLMs) have demonstrated remarkable capabilities out of box for a wide range of applications, yet accuracy still remains a major growth area, especially in mission-critical domains such as biomedicine. An effective method to calibrate the confidence level on LLM responses is essential to automatically detect errors and facilitate human-in-the-loop verification. An important source of calibration signals stems from expert-stipulated programmatic supervision, which is often available at low cost but has its own limitations such as noise and coverage. In this paper, we introduce a Pareto optimal self-supervision framework that can leverage available programmatic supervision to systematically calibrate LLM responses by producing a risk score for every response, without any additional manual efforts. This is accomplished by learning a harmonizer model to align LLM output with other available supervision sources, which would assign higher risk scores to more uncertain LLM responses and facilitate error correction. Experiments on standard relation extraction tasks in biomedical and general domains demonstrate the promise of this approach, with our proposed risk scores highly correlated with the real error rate of LLMs. For the most uncertain test instances, dynamic prompting based on our proposed risk scores results in significant accuracy improvement for off-the-shelf LLMs, boosting GPT-3 results past state-of-the-art (SOTA) weak supervision and GPT-4 results past SOTA supervised results on challenging evaluation datasets.
翻译:大语言模型(LLMs)在广泛任务中展现出开箱即用的卓越能力,但准确性仍是关键提升领域,尤其在生物医学等关键任务领域。有效校准LLM响应置信度的方法对于自动检测错误并促进人类参与验证至关重要。校准信号的重要来源来自专家制定的程序化监督,这类监督成本低廉但存在噪声和覆盖范围等局限性。本文提出一种帕累托最优自监督框架,可利用现有程序化监督通过为每个响应生成风险分数系统性地校准LLM响应,无需额外人工投入。通过学习一个协调器模型将LLM输出与其他可用监督源对齐,为更不确定的LLM响应分配更高风险分数并促进错误修正。在生物医学和通用领域的标准关系抽取任务上的实验证明了该方法的潜力,我们提出的风险分数与LLM真实错误率高度相关。对于最不确定的测试实例,基于所提风险分数的动态提示使得现成LLM的准确性显著提升,在挑战性评估数据集上,GPT-3结果超越最先进弱监督方法,GPT-4结果超越最先进监督方法。