This paper considers the automatic classification of herding behavior in the cluttered low-visibility environment that typically surrounds towed fishing gear. The paper compares three convolutional and attention-based deep action recognition network architectures trained end-to-end on a small set of video sequences captured by a remotely controlled camera and classified by an expert in fishing technology. The sequences depict a scene in front of a fishing trawl where the conventional herding mechanism has been replaced by directed laser light. The goal is to detect the presence of a fish in the sequence and classify whether or not the fish reacts to the lasers. A two-stream CNN model, a CNN-transformer hybrid, and a pure transformer model were trained end-to-end to achieve 63%, 54%, and 60% 10-fold classification accuracy on the three-class task when compared to the human expert. Inspection of the activation maps learned by the three networks raises questions about the attributes of the sequences the models may be learning, specifically whether changes in viewpoint introduced by human camera operators that affect the position of laser lines in the video frames may interfere with the classification. This underlines the importance of careful experimental design when capturing scientific data for automatic end-to-end evaluation and the usefulness of inspecting the trained models.
翻译:本文研究了在拖网渔具周围典型低能见度干扰环境下,鱼类驱集行为的自动分类问题。论文对比了三种基于卷积与注意力的深度动作识别网络架构,这些网络在由远程控制摄像机拍摄并经渔业技术专家分类的小规模视频序列集上实现了端到端训练。视频序列呈现了拖网前方场景,其中传统驱集机制已被定向激光取代。研究旨在检测序列中是否存在鱼类,并分类该鱼是否对激光产生反应。我们分别训练了双流CNN模型、CNN-Transformer混合模型及纯Transformer模型,在三人分类任务中与人类专家相比,其十折交叉验证分类准确率分别达到63%、54%和60%。对三个网络学习到的激活图进行分析发现,模型可能学习的序列属性存在争议——特别是操作摄像机的人为角度变化会影响视频帧中激光线条位置,并可能干扰分类结果。这凸显了为自动端到端评估采集科学数据时精心设计实验的重要性,以及检查训练后模型的有效性。