Pig aggression classification using CNN, Transformers and Recurrent Networks

The development of techniques that can be used to analyze and detect animal behavior is a crucial activity for the livestock sector, as it is possible to monitor the stress and animal welfare and contributes to decision making in the farm. Thus, the development of applications can assist breeders in making decisions to improve production performance and reduce costs, once the animal behavior is analyzed by humans and this can lead to susceptible errors and time consumption. Aggressiveness in pigs is an example of behavior that is studied to reduce its impact through animal classification and identification. However, this process is laborious and susceptible to errors, which can be reduced through automation by visually classifying videos captured in controlled environment. The captured videos can be used for training and, as a result, for classification through computer vision and artificial intelligence, employing neural network techniques. The main techniques utilized in this study are variants of transformers: STAM, TimeSformer, and ViViT, as well as techniques using convolutions, such as ResNet3D2, Resnet(2+1)D, and CnnLstm. These techniques were employed for pig video classification with the objective of identifying aggressive and non-aggressive behaviors. In this work, various techniques were compared to analyze the contribution of using transformers, in addition to the effectiveness of the convolution technique in video classification. The performance was evaluated using accuracy, precision, and recall. The TimerSformer technique showed the best results in video classification, with median accuracy of 0.729.

翻译：动物行为分析与检测技术的开发对畜牧业至关重要，因其可监测应激反应与动物福利状况，并为农场决策提供支持。通过开发自动化应用辅助养殖者制定决策，既能提升生产性能又可降低成本——传统依靠人工分析动物行为的方式不仅耗时且易出错。猪只攻击性行为是当前重点研究的行为类型之一，旨在通过动物分类与识别技术降低其影响。然而，该过程既繁琐又易出错，而基于受控环境视频的自动化视觉分类可有效减少此类问题。所采集的视频可用于模型训练，进而通过计算机视觉与人工智能技术（特别是神经网络技术）实现行为分类。本研究主要采用Transformer架构变体：时空注意力模块（STAM）、时序变换器（TimeSformer）与视频视觉变换器（ViViT），以及基于卷积的技术：3D残差网络（ResNet3D2）、(2+1)D残差网络（ResNet(2+1)D）与卷积长短时记忆网络（CnnLstm）。这些技术被应用于猪只视频分类，旨在识别攻击性与非攻击性行为。本研究通过对比多种技术方案，分析Transformer架构的贡献程度及卷积技术在视频分类中的有效性，采用准确率、精确率与召回率进行性能评估。实验表明，TimeSformer技术在视频分类中表现最佳，中位准确率达0.729。