JEMA: A Joint Embedding Framework for Scalable Co-Learning with Multimodal Alignment

This work introduces JEMA (Joint Embedding with Multimodal Alignment), a novel co-learning framework tailored for laser metal deposition (LMD), a pivotal process in metal additive manufacturing. As Industry 5.0 gains traction in industrial applications, efficient process monitoring becomes increasingly crucial. However, limited data and the opaque nature of AI present challenges for its application in an industrial setting. JEMA addresses this challenges by leveraging multimodal data, including multi-view images and metadata such as process parameters, to learn transferable semantic representations. By applying a supervised contrastive loss function, JEMA enables robust learning and subsequent process monitoring using only the primary modality, simplifying hardware requirements and computational overhead. We investigate the effectiveness of JEMA in LMD process monitoring, focusing specifically on its generalization to downstream tasks such as melt pool geometry prediction, achieved without extensive fine-tuning. Our empirical evaluation demonstrates the high scalability and performance of JEMA, particularly when combined with Vision Transformer models. We report an 8% increase in performance in multimodal settings and a 1% improvement in unimodal settings compared to supervised contrastive learning. Additionally, the learned embedding representation enables the prediction of metadata, enhancing interpretability and making possible the assessment of the added metadata's contributions. Our framework lays the foundation for integrating multisensor data with metadata, enabling diverse downstream tasks within the LMD domain and beyond.

翻译：本研究提出了JEMA（多模态对齐联合嵌入框架），这是一种专为激光金属沉积（LMD）——金属增材制造中的关键工艺——设计的新型协同学习框架。随着工业5.0在工业应用中获得关注，高效的过程监控变得日益重要。然而，有限的数据和人工智能的不透明性为其在工业环境中的应用带来了挑战。JEMA通过利用多模态数据（包括多视角图像和过程参数等元数据）来学习可迁移的语义表示，从而应对这些挑战。通过应用监督对比损失函数，JEMA实现了稳健的学习，并随后仅使用主要模态进行过程监控，简化了硬件需求和计算开销。我们研究了JEMA在LMD过程监控中的有效性，特别关注其在不进行大量微调的情况下对下游任务（如熔池几何形状预测）的泛化能力。我们的实证评估证明了JEMA的高可扩展性和优异性能，尤其是在与Vision Transformer模型结合时。与监督对比学习相比，我们在多模态设置中报告了8%的性能提升，在单模态设置中报告了1%的性能提升。此外，学习到的嵌入表示能够预测元数据，增强了可解释性，并使得评估所添加元数据的贡献成为可能。我们的框架为整合多传感器数据与元数据奠定了基础，从而能够在LMD领域及更广范围内实现多样化的下游任务。