An interesting problem in many video-based applications is the generation of short synopses by selecting the most informative frames, a procedure which is known as video summarization. For sign language videos the benefits of using the $t$-parameterized counterpart of the curvature of the 2-D signer's wrist trajectory to identify keyframes, have been recently reported in the literature. In this paper we extend these ideas by modeling the 3-D hand motion that is extracted from each frame of the video. To this end we propose a new informative function based on the $t$-parameterized curvature and torsion of the 3-D trajectory. The method to characterize video frames as keyframes depends on whether the motion occurs in 2-D or 3-D space. Specifically, in the case of 3-D motion we look for the maxima of the harmonic mean of the curvature and torsion of the target's trajectory; in the planar motion case we seek for the maxima of the trajectory's curvature. The proposed 3-D feature is experimentally evaluated in applications of sign language videos on (1) objective measures using ground-truth keyframe annotations, (2) human-based evaluation of understanding, and (3) gloss classification and the results obtained are promising.
翻译:在众多基于视频的应用中,一个有趣的问题是通过选择最具信息量的帧来生成简短摘要,这一过程被称为视频摘要。对于手语视频,已有文献报道了利用二维手语者手腕轨迹曲率的$t$参数化对应物来识别关键帧的优势。本文通过建模从视频每一帧中提取的三维手部运动,进一步拓展了这些思想。为此,我们提出了一种基于三维轨迹$t$参数化曲率与挠率的新信息函数。将视频帧表征为关键帧的方法取决于运动发生在二维还是三维空间:在三维运动情况下,我们寻找目标轨迹曲率与挠率调和平均值的极大值;在平面运动情况下,则寻找轨迹曲率的极大值。所提出的三维特征在手语视频应用中通过以下方式进行了实验评估:(1)基于真实关键帧标注的客观指标;(2)基于人类理解的主观评价;(3)词汇分类。实验结果表明该方法是有效的。