Soft dynamic time warping (SDTW) is a differentiable loss function that allows for training neural networks from weakly aligned data. Typically, SDTW is used to iteratively compute and refine soft alignments that compensate for temporal deviations between the training data and its weakly annotated targets. One major problem is that a mismatch between the estimated soft alignments and the reference alignments in the early training stage leads to incorrect parameter updates, making the overall training procedure unstable. In this paper, we investigate such stability issues by considering the task of pitch class estimation from music recordings as an illustrative case study. In particular, we introduce and discuss three conceptually different strategies (a hyperparameter scheduling, a diagonal prior, and a sequence unfolding strategy) with the objective of stabilizing intermediate soft alignment results. Finally, we report on experiments that demonstrate the effectiveness of the strategies and discuss efficiency and implementation issues.
翻译:软动态时间规整(SDTW)是一种可微损失函数,可用于从弱对齐数据中训练神经网络。通常,SDTW用于迭代计算并优化软对齐,以补偿训练数据与其弱标注目标之间的时间偏差。一个主要问题是,在训练早期阶段,估计的软对齐与参考对齐之间的不匹配会导致错误的参数更新,从而使整体训练过程不稳定。在本文中,我们以音乐录音中的音高类别估计任务为例,研究此类稳定性问题。具体而言,我们引入并讨论了三种概念上不同的策略(超参数调度、对角先验和序列展开策略),其目标是稳定中间软对齐结果。最后,我们报告了验证这些策略有效性的实验,并讨论了效率与实现问题。