Analyzing and visualizing scientific ensemble datasets with high dimensionality and complexity poses significant challenges. Dimensionality reduction techniques and autoencoders are powerful tools for extracting features, but they often struggle with such high-dimensional data. This paper presents an enhanced autoencoder framework that incorporates a clustering loss, based on the soft silhouette score, alongside a contrastive loss to improve the visualization and interpretability of ensemble datasets. First, EfficientNetV2 is used to generate pseudo-labels for the unlabeled portions of the scientific ensemble datasets. By jointly optimizing the reconstruction, clustering, and contrastive objectives, our method encourages similar data points to group together while separating distinct clusters in the latent space. UMAP is subsequently applied to this latent representation to produce 2D projections, which are evaluated using the silhouette score. Multiple types of autoencoders are evaluated and compared based on their ability to extract meaningful features. Experiments on two scientific ensemble datasets - channel structures in soil derived from Markov chain Monte Carlo, and droplet-on-film impact dynamics - show that models incorporating clustering or contrastive loss marginally outperform the baseline approaches.
翻译:分析和可视化具有高维度和复杂性的科学集合数据集面临重大挑战。降维技术和自编码器是提取特征的有力工具,但它们通常难以处理此类高维数据。本文提出一种增强的自编码器框架,该框架结合了基于软轮廓分数的聚类损失以及对比损失,以提升集合数据集的可视化与可解释性。首先,使用EfficientNetV2为科学集合数据集的未标记部分生成伪标签。通过联合优化重构、聚类和对比目标,我们的方法促使相似数据点在潜在空间中聚集,同时分离不同的簇。随后对潜在表示应用UMAP以生成二维投影,并使用轮廓分数进行评估。基于提取有意义特征的能力,对多种类型的自编码器进行了评估和比较。在两个科学集合数据集上的实验——源自马尔可夫链蒙特卡洛的土壤通道结构,以及液滴在薄膜上的撞击动力学——表明,结合聚类或对比损失的模型略微优于基线方法。