Generative Large language models (LLMs) have demonstrated remarkable capabilities for a wide range of applications, but reducing ungrounded or erroneous responses remains a major growth area. Unlike task-specific models, there lack an effective method to calibrate the confidence level of LLM responses to indicate potential errors and facilitate human-in-the-loop verification. An important source of calibration stems from expert-stipulated programmatic supervision, which is often available at low cost but has its own limitations such as noise and coverage. In this paper, we introduce a Pareto optimal self-supervision framework that can leverage available programmatic supervision to systematically calibrate LLM responses by producing a risk score for every LLM response, without any additional manual efforts. This is accomplished by learning a harmonizer model to align with LLM output as well as other weak supervision sources. The model assigns higher risk scores to more uncertain LLM responses and facilitate error correction. Experiments on standard relation extraction and classification tasks in biomedical and general domains demonstrate that the proposed risk score is highly correlated with the actual LLM error rate. By using a dynamic prompting strategy based on the risk score, we observed significant accuracy improvement for off-the-shelf LLMs, boosting GPT-3.5 results past state-of-the-art (SOTA) weak supervision model and GPT-4 results past SOTA supervised results on challenging evaluation datasets.
翻译:大型生成式语言模型(LLMs)在各类应用中展现出卓越能力,但减少无依据或错误响应仍是关键发展领域。与特定任务模型不同,目前缺乏有效方法校准LLM响应的置信水平以指示潜在错误并促进人工验证。校准的重要来源之一是专家指定的程序化监督,这种监督通常成本低廉,但存在噪声和覆盖范围有限等局限性。本文提出一种帕累托最优自监督框架,能够利用现有程序化监督,通过为每条LLM响应生成风险分数,无需额外人工即可系统性地校准LLM响应。该方法通过学习一个协调器模型,使其与LLM输出及其他弱监督源对齐,为更不确定的LLM响应分配更高风险分数,从而促进错误纠正。在生物医学和通用领域的标准关系抽取与分类任务实验中,所提出的风险分数与LLM实际错误率高度相关。基于风险分数采用动态提示策略后,现成LLM的准确性显著提升:GPT-3.5在挑战性评估数据集上的表现超越当前最优(SOTA)弱监督模型,GPT-4超越SOTA有监督方法。