Cross-modal representation learning learns a shared embedding between two or more modalities to improve performance in a given task compared to using only one of the modalities. Cross-modal representation learning from different data types -- such as images and time-series data (e.g., audio or text data) -- requires a deep metric learning loss that minimizes the distance between the modality embeddings. In this paper, we propose to use the contrastive or triplet loss, which uses positive and negative identities to create sample pairs with different labels, for cross-modal representation learning between image and time-series modalities (CMR-IS). By adapting the triplet loss for cross-modal representation learning, higher accuracy in the main (time-series classification) task can be achieved by exploiting additional information of the auxiliary (image classification) task. We present a triplet loss with a dynamic margin for single label and sequence-to-sequence classification tasks. We perform extensive evaluations on synthetic image and time-series data, and on data for offline handwriting recognition (HWR) and on online HWR from sensor-enhanced pens for classifying written words. Our experiments show an improved classification accuracy, faster convergence, and better generalizability due to an improved cross-modal representation. Furthermore, the more suitable generalizability leads to a better adaptability between writers for online HWR.
翻译:跨模态表示学习通过学习两种或多种模态之间的共享嵌入,相较于仅使用单一模态,能够在特定任务中提升性能。针对不同数据类型(如图像与时间序列数据,例如音频或文本数据)的跨模态表示学习,需要采用深度度量学习损失函数来最小化模态嵌入之间的距离。本文提出利用对比损失或三元组损失(通过正负身份标识构建不同标签的样本对)实现图像与时间序列模态间的跨模态表示学习(CMR-IS)。通过将三元组损失适配于跨模态表示学习,可借助辅助任务(图像分类)的额外信息提升主任务(时间序列分类)的准确率。我们针对单标签和序列到序列分类任务,提出了一种具有动态边界的三元组损失函数。在合成图像与时间序列数据、离线手写识别(HWR)数据以及基于传感器增强笔的在线HWR数据上进行了广泛评估。实验结果表明,由于跨模态表示的优化,模型在分类准确率、收敛速度及泛化能力方面均得到提升。此外,更优异的泛化能力使在线HWR在不同书写者之间实现了更好的适应性。