Batch Normalization's (BN) unique property of depending on other samples in a batch is known to cause problems in several tasks, including sequence modeling. Yet, BN-related issues are hardly studied for long video understanding, despite the ubiquitous use of BN in CNNs (Convolutional Neural Networks) for feature extraction. Especially in surgical workflow analysis, where the lack of pretrained feature extractors has led to complex, multi-stage training pipelines, limited awareness of BN issues may have hidden the benefits of training CNNs and temporal models end to end. In this paper, we analyze pitfalls of BN in video learning, including issues specific to online tasks such as a 'cheating' effect in anticipation. We observe that BN's properties create major obstacles for end-to-end learning. However, using BN-free backbones, even simple CNN-LSTMs beat the state of the art {\color{\colorrevtwo}on three surgical workflow benchmarks} by utilizing adequate end-to-end training strategies which maximize temporal context. We conclude that awareness of BN's pitfalls is crucial for effective end-to-end learning in surgical tasks. By reproducing results on natural-video datasets, we hope our insights will benefit other areas of video learning as well. Code is available at: \url{https://gitlab.com/nct_tso_public/pitfalls_bn}
翻译:批归一化(Batch Normalization, BN)依赖于批内其他样本的独特性质,已知会引发序列建模等多项任务中的问题。然而,尽管卷积神经网络(CNN)在特征提取中普遍采用BN,其对长视频理解的影响却鲜有研究。尤其在手术工作流分析领域,由于缺乏预训练特征提取器,导致复杂的多阶段训练流程占据主导地位,而对BN问题的认知不足可能掩盖了端到端训练CNN与时序模型的优势。本文分析了BN在视频学习中的陷阱,包括在线任务中特有的“作弊”效应(如预期性作弊)。我们观察到,BN的特性为端到端学习制造了主要障碍。然而,采用无BN骨干网络,即使是简单的CNN-LSTM模型,通过充分利用时间上下文的端到端训练策略,也能在三个手术工作流基准上超越现有最优水平。我们得出结论:认知BN的陷阱对于手术任务中的有效端到端学习至关重要。通过复现自然视频数据集上的结果,我们期望这些见解也能惠及视频学习的其他领域。代码见:\url{https://gitlab.com/nct_tso_public/pitfalls_bn}