Uncertainty in machine learning models is a timely and vast field of research. In supervised learning, uncertainty can already occur in the first stage of the training process, the annotation phase. This scenario is particularly evident when some instances cannot be definitively classified. In other words, there is inevitable ambiguity in the annotation step and hence, not necessarily a "ground truth" associated with each instance. The main idea of this work is to drop the assumption of a ground truth label and instead embed the annotations into a multidimensional space. This embedding is derived from the empirical distribution of annotations in a Bayesian setup, modeled via a Dirichlet-Multinomial framework. We estimate the model parameters and posteriors using a stochastic Expectation Maximization algorithm with Markov Chain Monte Carlo steps. The methods developed in this paper readily extend to various situations where multiple annotators independently label instances. To showcase the generality of the proposed approach, we apply our approach to three benchmark datasets for image classification and Natural Language Inference. Besides the embeddings, we can investigate the resulting correlation matrices, which reflect the semantic similarities of the original classes very well for all three exemplary datasets.
翻译:机器学习模型中的不确定性是一个及时且广泛的研究领域。在监督学习中,不确定性可能早在训练过程的第一阶段——标注阶段——就已出现。当某些实例无法被明确分类时,这种情况尤为明显。换言之,标注步骤中存在不可避免的模糊性,因此并非每个实例都必然对应一个"真实标签"。本工作的核心思想是摒弃真实标签的假设,转而将标注结果嵌入到一个多维空间中。该嵌入是在贝叶斯框架下通过狄利克雷-多项分布模型,从标注的经验分布中推导得出的。我们采用带有马尔可夫链蒙特卡洛步骤的随机期望最大化算法来估计模型参数和后验分布。本文开发的方法可轻松扩展到多个标注者独立标注实例的各种场景。为展示所提方法的普适性,我们将其应用于图像分类和自然语言推理的三个基准数据集。除嵌入表示外,我们还能研究由此产生的相关矩阵,这些矩阵在所有三个示例数据集中都能很好地反映原始类别的语义相似性。