As LLMs are increasingly integrated into human-in-the-loop content moderation systems, a central challenge is deciding when their outputs can be trusted versus when escalation for human review is preferable. We propose a novel framework for supervised LLM uncertainty quantification, learning a dedicated meta-model based on LLM Performance Predictors (LPPs) derived from LLM outputs: log-probabilities, entropy, and novel uncertainty attribution indicators. We demonstrate that our method enables cost-aware selective classification in real-world human-AI workflows: escalating high-risk cases while automating the rest. Experiments across state-of-the-art LLMs, including both off-the-shelf (Gemini, GPT) and open-source (Llama, Qwen), on multimodal and multilingual moderation tasks, show significant improvements over existing uncertainty estimators in accuracy-cost trade-offs. Beyond uncertainty estimation, the LPPs enhance explainability by providing new insights into failure conditions (e.g., ambiguous content vs. under-specified policy). This work establishes a principled framework for uncertainty-aware, scalable, and responsible human-AI moderation workflows.
翻译:随着大型语言模型(LLM)日益融入人在回路的內容审核系统,一个核心挑战在于判断何时可以信任其输出,何时更适合升级至人工审核。我们提出一种新颖的监督式LLM不确定性量化框架,通过基于LLM输出衍生的性能预测器(LPP)——包括对数概率、熵值及新型不确定性归因指标——训练专用的元模型。实验表明,该方法能够在真实人机协作工作流中实现成本感知的选择性分类:在自动化处理常规内容的同时,将高风险案例升级至人工审核。通过在多模态与多语言审核任务上对前沿LLM(包括商用模型Gemini、GPT及开源模型Llama、Qwen)进行实验验证,本方法在准确率与成本的权衡中显著优于现有不确定性估计器。除不确定性估计外,LPP还通过揭示故障条件(如内容模糊性与策略未充分定义)为系统可解释性提供新视角。本研究为构建具有不确定性感知能力、可扩展且负责任的人机协同审核工作流建立了理论框架。