Zero-shot action recognition, which recognizes actions in videos without having received any training examples, is gaining wide attention considering it can save labor costs and training time. Nevertheless, the performance of zero-shot learning is still unsatisfactory, which limits its practical application. To solve this problem, this study proposes a framework to improve zero-shot action recognition using human instructions with text descriptions. The proposed framework manually describes video contents, which incurs some labor costs; in many situations, the labor costs are worth it. We manually annotate text features for each action, which can be a word, phrase, or sentence. Then by computing the matching degrees between the video and all text features, we can predict the class of the video. Furthermore, the proposed model can also be combined with other models to improve its accuracy. In addition, our model can be continuously optimized to improve the accuracy by repeating human instructions. The results with UCF101 and HMDB51 showed that our model achieved the best accuracy and improved the accuracies of other models.
翻译:零样本动作识别能够在不依赖训练样本的情况下识别视频中的动作,因节省人力成本和训练时间而受到广泛关注。然而,零样本学习的性能仍不尽如人意,这限制了其实际应用。为解决这一问题,本研究提出一个利用人类指令与文本描述改进零样本动作识别的框架。该框架通过人工描述视频内容,虽产生一定人力成本,但在多数情况下这种成本是值得的。我们为每个动作手动标注文本特征(可以是单词、短语或句子),通过计算视频与所有文本特征的匹配程度来预测视频类别。此外,所提模型可与其他模型结合以提升准确率,且通过重复人类指令可持续优化模型性能。在UCF101和HMDB51数据集上的实验结果表明,我们的模型取得了最佳准确率,并显著提升了其他模型的识别精度。