Robust visual recognition in underwater environments remains a significant challenge due to complex distortions such as turbidity, low illumination, and occlusion, which severely degrade the performance of standard vision systems. This paper introduces AQUA20, a comprehensive benchmark dataset comprising 8,171 underwater images across 20 marine species reflecting real-world environmental challenges such as illumination, turbidity, occlusions, etc., providing a valuable resource for underwater visual understanding. Thirteen state-of-the-art deep learning models, including lightweight CNNs (SqueezeNet, MobileNetV2) and transformer-based architectures (ViT, ConvNeXt), were evaluated to benchmark their performance in classifying marine species under challenging conditions. Our experimental results show ConvNeXt achieving the best performance, with a Top-3 accuracy of 98.82% and a Top-1 accuracy of 90.69%, as well as the highest overall F1-score of 88.92% with moderately large parameter size. The results obtained from our other benchmark models also demonstrate trade-offs between complexity and performance. We also provide an extensive explainability analysis using GRAD-CAM and LIME for interpreting the strengths and pitfalls of the models. Our results reveal substantial room for improvement in underwater species recognition and demonstrate the value of AQUA20 as a foundation for future research in this domain. The dataset is publicly available at: https://huggingface.co/datasets/taufiktrf/AQUA20.
翻译:由于浑浊度、低光照和遮挡等复杂畸变严重降低了标准视觉系统的性能,水下环境中的鲁棒视觉识别仍然是一个重大挑战。本文介绍了AQUA20,这是一个全面的基准数据集,包含涵盖20个海洋物种的8,171张水下图像,反映了光照、浑浊度、遮挡等真实环境挑战,为水下视觉理解提供了宝贵资源。我们评估了十三种最先进的深度学习模型,包括轻量级CNN(SqueezeNet、MobileNetV2)和基于Transformer的架构(ViT、ConvNeXt),以基准测试它们在挑战性条件下分类海洋物种的性能。我们的实验结果表明ConvNeXt取得了最佳性能,Top-3准确率为98.82%,Top-1准确率为90.69%,并且在参数规模中等偏大的情况下获得了最高的整体F1分数88.92%。从其他基准模型获得的结果也展示了复杂性与性能之间的权衡。我们还使用GRAD-CAM和LIME提供了广泛的可解释性分析,以阐释模型的优势和缺陷。我们的结果揭示了水下物种识别仍有巨大的改进空间,并证明了AQUA20作为该领域未来研究基础的价值。该数据集公开发布于:https://huggingface.co/datasets/taufiktrf/AQUA20。