PolypNextLSTM: A lightweight and fast polyp video segmentation network using ConvNext and ConvLSTM

Commonly employed in polyp segmentation, single image UNet architectures lack the temporal insight clinicians gain from video data in diagnosing polyps. To mirror clinical practices more faithfully, our proposed solution, PolypNextLSTM, leverages video-based deep learning, harnessing temporal information for superior segmentation performance with the least parameter overhead, making it possibly suitable for edge devices. PolypNextLSTM employs a UNet-like structure with ConvNext-Tiny as its backbone, strategically omitting the last two layers to reduce parameter overhead. Our temporal fusion module, a Convolutional Long Short Term Memory (ConvLSTM), effectively exploits temporal features. Our primary novelty lies in PolypNextLSTM, which stands out as the leanest in parameters and the fastest model, surpassing the performance of five state-of-the-art image and video-based deep learning models. The evaluation of the SUN-SEG dataset spans easy-to-detect and hard-to-detect polyp scenarios, along with videos containing challenging artefacts like fast motion and occlusion. Comparison against 5 image-based and 5 video-based models demonstrates PolypNextLSTM's superiority, achieving a Dice score of 0.7898 on the hard-to-detect polyp test set, surpassing image-based PraNet (0.7519) and video-based PNSPlusNet (0.7486). Notably, our model excels in videos featuring complex artefacts such as ghosting and occlusion. PolypNextLSTM, integrating pruned ConvNext-Tiny with ConvLSTM for temporal fusion, not only exhibits superior segmentation performance but also maintains the highest frames per speed among evaluated models. Access code here https://github.com/mtec-tuhh/PolypNextLSTM

翻译：在息肉分割中常用的单图像UNet架构缺乏临床医生从视频数据中诊断息肉时获得的时序洞察力。为更忠实地模拟临床实践，我们提出的解决方案PolypNextLSTM利用基于视频的深度学习，以最少的参数开销利用时序信息实现卓越的分割性能，使其可能适用于边缘设备。PolypNextLSTM采用类似UNet的结构，以ConvNext-Tiny为主干网络，策略性地省略最后两层以减少参数开销。我们的时序融合模块——卷积长短期记忆网络（ConvLSTM）有效利用了时序特征。主要创新点在于PolypNextLSTM，它成为参数最少、速度最快的模型，其性能超越了五种基于最先进图像和视频的深度学习模型。在SUN-SEG数据集上的评估涵盖易检测和难检测的息肉场景，以及包含快速运动和遮挡等挑战性伪影的视频。与5种基于图像和5种基于视频的模型对比显示，PolypNextLSTM在难检测息肉测试集上取得0.7898的Dice分数，超越基于图像的PraNet（0.7519）和基于视频的PNSPlusNet（0.7486）。值得注意的是，我们的模型在包含重影和遮挡等复杂伪影的视频中表现优异。PolypNextLSTM将精简后的ConvNext-Tiny与用于时序融合的ConvLSTM相结合，不仅展现出卓越的分割性能，还在评估模型中保持了最高的处理速度。访问代码见：https://github.com/mtec-tuhh/PolypNextLSTM