Segmenting dance video into short movements is a popular way to easily understand dance choreography. However, it is currently done manually and requires a significant amount of effort by experts. That is, even if many dance videos are available on social media (e.g., TikTok and YouTube), it remains difficult for people, especially novices, to casually watch short video segments to practice dance choreography. In this paper, we propose a method to automatically segment a dance video into each movement. Given a dance video as input, we first extract visual and audio features: the former is computed from the keypoints of the dancer in the video, and the latter is computed from the Mel spectrogram of the music in the video. Next, these features are passed to a Temporal Convolutional Network (TCN), and segmentation points are estimated by picking peaks of the network output. To build our training dataset, we annotate segmentation points to dance videos in the AIST Dance Video Database, which is a shared database containing original street dance videos with copyright-cleared dance music. The evaluation study shows that the proposed method (i.e., combining the visual and audio features) can estimate segmentation points with high accuracy. In addition, we developed an application to help dancers practice choreography using the proposed method.
翻译:将舞蹈视频分割为短动作片段是理解舞蹈编排的常用便捷方法。然而,目前该过程仍需人工完成,且需要专家投入大量精力。这意味着,尽管社交媒体(如TikTok和YouTube)上存在大量舞蹈视频,人们(尤其是初学者)仍难以通过随意观看短视频片段来练习舞蹈编排。本文提出一种将舞蹈视频自动分割为各动作片段的方法。给定输入舞蹈视频,我们首先提取视觉与音频特征:前者通过视频中舞者的关键点计算获得,后者通过视频音乐的梅尔频谱图计算获得。随后,这些特征被输入至时序卷积网络(TCN),并通过选取网络输出的峰值来估计分割点。为构建训练数据集,我们对AIST舞蹈视频数据库中的舞蹈视频进行了分割点标注,该共享数据库包含原创街舞视频及已获版权许可的舞蹈音乐。评估研究表明,所提方法(即结合视觉与音频特征)能够以较高精度估计分割点。此外,我们基于所提方法开发了一款辅助舞者练习编舞的应用程序。