Data embeddings with CLIP and ImageBind provide powerful features for the analysis of multimedia and/or multimodal data. We assess their performance here for classification using a Gaussian Mixture models (GMMs) based layer as an alternative to the standard Softmax layer. GMMs based classifiers have recently been shown to have interesting performances as part of deep learning pipelines trained end-to-end. Our first contribution is to investigate GMM based classification performance taking advantage of the embedded spaces CLIP and ImageBind. Our second contribution is in proposing our own GMM based classifier with a lower parameters count than previously proposed. Our findings are, that in most cases, on these tested embedded spaces, one gaussian component in the GMMs is often enough for capturing each class, and we hypothesize that this may be due to the contrastive loss used for training these embedded spaces that naturally concentrates features together for each class. We also observed that ImageBind often provides better performance than CLIP for classification of image datasets even when these embedded spaces are compressed using PCA.
翻译:CLIP与ImageBind生成的数据嵌入为多媒体及/或多模态数据分析提供了强大的特征表示。本研究评估了基于高斯混合模型(GMM)的分类层替代标准Softmax层时的分类性能。近期研究表明,作为端到端训练的深度学习管道组成部分,基于GMM的分类器展现出值得关注的性能。我们的首要贡献在于:利用CLIP和ImageBind生成的嵌入空间,系统探究基于GMM的分类性能。第二项贡献是提出了一种参数数量低于现有方案的GMM分类器。研究发现:在大多数测试场景中,这些嵌入空间上的每个类别仅需单个高斯分量即可有效表征,我们推测这可能是由于训练嵌入空间时采用的对比损失函数会自然促使同类特征聚集。此外还观察到:即使经过PCA降维处理,ImageBind在图像数据集分类任务中通常仍优于CLIP。