We illustrate how a calibrated model can help balance common trade-offs in task-oriented parsing. In a simulated annotator-in-the-loop experiment, we show that well-calibrated confidence scores allow us to balance cost with annotator load, improving accuracy with a small number of interactions. We then examine how confidence scores can help optimize the trade-off between usability and safety. We show that confidence-based thresholding can substantially reduce the number of incorrect low-confidence programs executed; however, this comes at a cost to usability. We propose the DidYouMean system which better balances usability and safety.
翻译:我们展示了经过校准的模型如何帮助平衡面向任务的语义解析中的常见权衡。在模拟标注者在环实验中,我们证明良好的校准置信度得分能够平衡成本与标注者工作量,通过少量交互即可提升准确率。接着,我们探讨了置信度得分如何帮助优化可用性与安全性之间的权衡。研究表明,基于置信度的阈值处理可大幅减少被执行的错误低置信度程序数量,但这会牺牲可用性。为此,我们提出了DidYouMean系统,能够更好地平衡可用性与安全性。