PolypNextLSTM: A lightweight and fast polyp video segmentation network using ConvNext and ConvLSTM

Commonly employed in polyp segmentation, single image UNet architectures lack the temporal insight clinicians gain from video data in diagnosing polyps. To mirror clinical practices more faithfully, our proposed solution, PolypNextLSTM, leverages video-based deep learning, harnessing temporal information for superior segmentation performance with the least parameter overhead, making it possibly suitable for edge devices. PolypNextLSTM employs a UNet-like structure with ConvNext-Tiny as its backbone, strategically omitting the last two layers to reduce parameter overhead. Our temporal fusion module, a Convolutional Long Short Term Memory (ConvLSTM), effectively exploits temporal features. Our primary novelty lies in PolypNextLSTM, which stands out as the leanest in parameters and the fastest model, surpassing the performance of five state-of-the-art image and video-based deep learning models. The evaluation of the SUN-SEG dataset spans easy-to-detect and hard-to-detect polyp scenarios, along with videos containing challenging artefacts like fast motion and occlusion. Comparison against 5 image-based and 5 video-based models demonstrates PolypNextLSTM's superiority, achieving a Dice score of 0.7898 on the hard-to-detect polyp test set, surpassing image-based PraNet (0.7519) and video-based PNSPlusNet (0.7486). Notably, our model excels in videos featuring complex artefacts such as ghosting and occlusion. PolypNextLSTM, integrating pruned ConvNext-Tiny with ConvLSTM for temporal fusion, not only exhibits superior segmentation performance but also maintains the highest frames per speed among evaluated models. Access code here https://github.com/mtec-tuhh/PolypNextLSTM

翻译：常用于息肉分割的单图UNet架构缺乏临床医生通过视频数据诊断息肉时所获的时间维度洞察。为更忠实反映临床实践，我们提出的方案PolypNextLSTM利用基于视频的深度学习，以最低参数量有效挖掘时间信息实现卓越分割性能，使其可能适用于边缘设备。PolypNextLSTM采用UNet-like结构，以ConvNext-Tiny为主干网络，策略性省略最后两层以减少参数量。其时间融合模块采用卷积长短期记忆网络有效利用时序特征。核心创新点在于PolypNextLSTM：作为参数最精简、速度最快的模型，其性能超越了五种基于图像与视频的先进深度学习模型。在SUN-SEG数据集上的评估涵盖易检测与难检测息肉场景，以及包含快速运动、遮挡等挑战性伪影的视频。与5种图像模型和5种视频模型的对比显示，PolypNextLSTM在难检测息肉测试集上以0.7898的Dice分数超越图像型PraNet（0.7519）和视频型PNSPlusNet（0.7486）。值得注意的是，模型在处理含重影、遮挡等复杂伪影的视频时表现卓越。通过将精简ConvNext-Tiny与ConvLSTM进行时序融合，PolypNextLSTM不仅展现了更优的分割性能，更在所有评估模型中保持最高的帧处理速度。代码访问地址：https://github.com/mtec-tuhh/PolypNextLSTM