Audio-visual cross-modality knowledge transfer for machine learning-based in-situ monitoring in laser additive manufacturing

Various machine learning (ML)-based in-situ monitoring systems have been developed to detect laser additive manufacturing (LAM) process anomalies and defects. Multimodal fusion can improve in-situ monitoring performance by acquiring and integrating data from multiple modalities, including visual and audio data. However, multimodal fusion employs multiple sensors of different types, which leads to higher hardware, computational, and operational costs. This paper proposes a cross-modality knowledge transfer (CMKT) methodology that transfers knowledge from a source to a target modality for LAM in-situ monitoring. CMKT enhances the usefulness of the features extracted from the target modality during the training phase and removes the sensors of the source modality during the prediction phase. This paper proposes three CMKT methods: semantic alignment, fully supervised mapping, and semi-supervised mapping. Semantic alignment establishes a shared encoded space between modalities to facilitate knowledge transfer. It utilizes a semantic alignment loss to align the distributions of the same classes (e.g., visual defective and audio defective classes) and a separation loss to separate the distributions of different classes (e.g., visual defective and audio defect-free classes). The two mapping methods transfer knowledge by deriving the features of one modality from the other modality using fully supervised and semi-supervised learning. The proposed CMKT methods were implemented and compared with multimodal audio-visual fusion in an LAM in-situ anomaly detection case study. The semantic alignment method achieves a 98.4% accuracy while removing the audio modality during the prediction phase, which is comparable to the accuracy of multimodal fusion (98.2%).

翻译：多种基于机器学习（ML）的原位监测系统已被开发用于检测激光增材制造（LAM）过程中的异常与缺陷。多模态融合通过采集并整合来自视觉与音频等多模态数据，能够提升原位监测性能。然而，多模态融合需使用多种不同类型的传感器，导致硬件、计算与运维成本升高。本文提出一种跨模态知识迁移（CMKT）方法，将知识从源模态迁移至目标模态，用于LAM原位监测。CMKT在训练阶段增强从目标模态提取特征的有效性，并在预测阶段移除源模态的传感器。本文提出三种CMKT方法：语义对齐、全监督映射与半监督映射。语义对齐通过建立模态间的共享编码空间以促进知识迁移，其采用语义对齐损失函数对齐同类分布（如视觉缺陷类与音频缺陷类），并采用分离损失函数区分不同类分布（如视觉缺陷类与音频无缺陷类）。两种映射方法分别通过全监督与半监督学习，从另一模态推导出当前模态的特征来实现知识迁移。所提出的CMKT方法在一个LAM原位异常检测案例中得以实现，并与多模态视听融合方法进行了比较。语义对齐方法在预测阶段移除音频模态后仍达到98.4%的准确率，与多模态融合的准确率（98.2%）相当。