Towards Vision Mixture of Experts for Wildlife Monitoring on the Edge

The explosion of IoT sensors in industrial, consumer and remote sensing use cases has come with unprecedented demand for computing infrastructure to transmit and to analyze petabytes of data. Concurrently, the world is slowly shifting its focus towards more sustainable computing. For these reasons, there has been a recent effort to reduce the footprint of related computing infrastructure, especially by deep learning algorithms, for advanced insight generation. The `TinyML' community is actively proposing methods to save communication bandwidth and excessive cloud storage costs while reducing algorithm inference latency and promoting data privacy. Such proposed approaches should ideally process multiple types of data, including time series, audio, satellite images, and video, near the network edge as multiple data streams has been shown to improve the discriminative ability of learning algorithms, especially for generating fine grained results. Incidentally, there has been recent work on data driven conditional computation of subnetworks that has shown real progress in using a single model to share parameters among very different types of inputs such as images and text, reducing the computation requirement of multi-tower multimodal networks. Inspired by such line of work, we explore similar per patch conditional computation for the first time for mobile vision transformers (vision only case), that will eventually be used for single-tower multimodal edge models. We evaluate the model on Cornell Sap Sucker Woods 60, a fine grained bird species discrimination dataset. Our initial experiments uses $4X$ fewer parameters compared to MobileViTV2-1.0 with a $1$% accuracy drop on the iNaturalist '21 birds test data provided as part of the SSW60 dataset.

翻译：工业、消费电子和遥感应用中物联网传感器的激增，对传输和分析海量数据的计算基础设施提出了前所未有的需求。与此同时，世界正逐步将焦点转向更可持续的计算模式。基于这些原因，近期学界致力于缩减相关计算基础设施（尤其是深度学习算法）的规模，以生成更深入的洞察。"微型机器学习"（TinyML）领域正积极提出多种方法，旨在节约通信带宽、降低云端存储成本、减少算法推理延迟并增强数据隐私性。理想情况下，这类方法应能处理多种数据类型（包括时间序列、音频、卫星图像和视频）并在网络边缘完成计算，因为多数据流已被证明能提升学习算法的判别能力，尤其在生成细粒度结果方面。值得注意的是，近期基于数据的子网络条件计算研究取得了实质性进展，通过单一模型在图像与文本等差异显著的输入类型间共享参数，有效降低了多塔式多模态网络的计算需求。受此类研究启发，我们首次针对移动视觉Transformer（纯视觉场景）探索了类似的基于图像块的条件计算机制，该机制最终将应用于单塔式多模态边缘模型。我们在Cornell Sap Sucker Woods 60（细粒度鸟类物种识别数据集）上评估了模型性能。初步实验表明：与MobileViTV2-1.0相比，我们的模型参数量减少$4$倍，而在SSW60数据集提供的iNaturalist '21鸟类测试数据上准确率仅下降$1$%。