Generative models, like large language models, are becoming increasingly relevant in our daily lives, yet a theoretical framework to assess their generalization behavior and uncertainty does not exist. Particularly, the problem of uncertainty estimation is commonly solved in an ad-hoc manner and task dependent. For example, natural language approaches cannot be transferred to image generation. In this paper we introduce the first bias-variance-covariance decomposition for kernel scores and their associated entropy. We propose unbiased and consistent estimators for each quantity which only require generated samples but not the underlying model itself. As an application, we offer a generalization evaluation of diffusion models and discover how mode collapse of minority groups is a contrary phenomenon to overfitting. Further, we demonstrate that variance and predictive kernel entropy are viable measures of uncertainty for image, audio, and language generation. Specifically, our approach for uncertainty estimation is more predictive of performance on CoQA and TriviaQA question answering datasets than existing baselines and can also be applied to closed-source models.
翻译:生成模型(如大型语言模型)正日益融入日常生活,但目前缺乏评估其泛化行为与不确定性的理论框架。尤其是不确定性估计问题通常以临时方式解决且依赖特定任务,例如自然语言方法无法迁移至图像生成领域。本文首次提出核评分及其相关熵的偏置-方差-协方差分解。我们为每个量构建了无偏且一致的估计量,仅需生成样本而无需依赖底层模型本身。作为应用,我们评估了扩散模型的泛化性,发现少数群体模式崩溃是与过拟合相反的现象。进一步,我们证明了方差与预测核熵可作为图像、音频和语言生成中不确定性的有效度量。具体而言,我们的不确定性估计方法在CoQA和TriviaQA问答数据集上的性能预测能力优于现有基线方法,且可应用于闭源模型。