To establish the trustworthiness of systems that automatically generate text captions for audio, images and video, existing reference-free metrics rely on large pretrained models which are impractical to accommodate in resource-constrained settings. To address this, we propose some metrics to elicit the model's confidence in its own generation. To assess how well these metrics replace correctness measures that leverage reference captions, we test their calibration with correctness measures. We discuss why some of these confidence metrics align better with certain correctness measures. Further, we provide insight into why temperature scaling of confidence metrics is effective. Our main contribution is a suite of well-calibrated lightweight confidence metrics for reference-free evaluation of captions in resource-constrained settings.
翻译:为了建立自动为音频、图像和视频生成文本字幕系统的可信度,现有的无参考评估指标依赖于大型预训练模型,这些模型在资源受限的环境中难以部署。针对这一问题,我们提出若干指标以激发模型对其自身生成结果的置信度。为评估这些指标在多大程度上可替代利用参考字幕的正确性度量方法,我们测试了它们与正确性度量之间的校准关系。我们讨论了为何部分置信度指标与特定正确性度量具有更好的对齐性。此外,我们深入分析了置信度指标温度缩放技术有效的原因。本研究的主要贡献在于提出了一套经过良好校准的轻量级置信度指标,适用于资源受限环境下字幕的无参考评估。