Batch Normalization's (BN) unique property of depending on other samples in a batch is known to cause problems in several tasks, including sequential modeling. Yet, BN-related issues are hardly studied for long video understanding, despite the ubiquitous use of BN in CNNs for feature extraction. Especially in surgical workflow analysis, where the lack of pretrained feature extractors has lead to complex, multi-stage training pipelines, limited awareness of BN issues may have hidden the benefits of training CNNs and temporal models end to end. In this paper, we %present and analyze known as well as novel pitfalls of BN in video learning, including issues specific to online tasks such as a 'cheating' effect in anticipation. We observe that BN's properties create major obstacles for end-to-end learning. However, using BN-free backbones, even simple CNN-LSTMs beat state of the art in two surgical tasks by utilizing adequate end-to-end training strategies which maximize temporal context. We conclude that awareness of BN's pitfalls is crucial for effective end-to-end learning in surgical tasks. By reproducing results on natural-video datasets, we hope our insights will benefit other areas of video learning as well. Code: \url{https://gitlab.com/nct_tso_public/pitfalls_bn}.
翻译:批量归一化(BN)依赖批次中其他样本的独特性质,已知会在包括序列建模在内的多项任务中引发问题。然而,尽管BN在用于特征提取的卷积神经网络(CNN)中广泛应用,其在长视频理解中的相关问题却鲜有研究。特别是在手术工作流分析中,由于缺乏预训练特征提取器导致复杂的多阶段训练流程,对BN问题的认知不足可能掩盖了端到端训练CNN与时序模型带来的优势。本文系统呈现并分析了视频学习中BN的已知及新陷阱,包括针对在线任务特有的"作弊"效应问题。我们观察到BN的性质为端到端学习制造了主要障碍。然而,通过采用无BN的特征提取主干网络,即使简单的CNN-LSTM模型也能通过最大化时序语境的端到端训练策略,在两项手术任务中超越现有最优方法。我们得出结论:认知BN陷阱对于手术任务中实现有效的端到端学习至关重要。通过在自然视频数据集上复现结果,希望我们的见解也能惠及视频学习的其他领域。代码:\url{https://gitlab.com/nct_tso_public/pitfalls_bn}。