The increasing complexity of Industry 4.0 systems brings new challenges regarding predictive maintenance tasks such as fault detection and diagnosis. A corresponding and realistic setting includes multi-source data streams from different modalities, such as sensors measurements time series, machine images, textual maintenance reports, etc. These heterogeneous multimodal streams also differ in their acquisition frequency, may embed temporally unaligned information and can be arbitrarily long, depending on the considered system and task. Whereas multimodal fusion has been largely studied in a static setting, to the best of our knowledge, there exists no previous work considering arbitrarily long multimodal streams alongside with related tasks such as prediction across time. Thus, in this paper, we first formalize this paradigm of heterogeneous multimodal learning in a streaming setting as a new one. To tackle this challenge, we propose StreaMulT, a Streaming Multimodal Transformer relying on cross-modal attention and on a memory bank to process arbitrarily long input sequences at training time and run in a streaming way at inference. StreaMulT improves the state-of-the-art metrics on CMU-MOSEI dataset for Multimodal Sentiment Analysis task, while being able to deal with much longer inputs than other multimodal models. The conducted experiments eventually highlight the importance of the textual embedding layer, questioning recent improvements in Multimodal Sentiment Analysis benchmarks.
翻译:工业4.0系统的日益复杂性为故障检测与诊断等预测性维护任务带来了新挑战。一种对应的现实场景涵盖来自不同模态的多源数据流,例如传感器测量时间序列、机器图像、文本维护报告等。这些异构多模态流在采集频率上存在差异,可能嵌入时间上未对齐的信息,并且根据所考虑的系统与任务,序列长度可任意延伸。尽管多模态融合在静态设置中已被广泛研究,但据我们所知,目前尚无工作考虑任意长的多模态流及其跨时间预测等相关任务。因此,本文首先将流式设置下的异构多模态学习范式正式定义为一个新问题。为应对这一挑战,我们提出StreaMulT——一种流式多模态Transformer,它依赖跨模态注意力机制和记忆库,在训练时处理任意长的输入序列,并在推理时以流式方式运行。StreaMulT在多模态情感分析任务的CMU-MOSEI数据集上提升了当前最优指标,同时能够处理远长于其他多模态模型的输入。实验最终突显了文本嵌入层的重要性,对近期多模态情感分析基准的改进提出了质疑。