Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data. Learning to schedule perceiving models and analyzing the collected videos intelligently will be potential sparks for VIoT. In this paper, to address the challenges posed by the fine-grained and interrelated vision tool usage of VIoT, we build VIoTGPT, the framework based on LLMs to correctly interact with humans, query knowledge videos, and invoke vision models to accomplish complicated tasks. To support VIoTGPT and related future works, we meticulously crafted the training dataset and established benchmarks involving 11 representative vision models across three categories based on semi-automatic annotations. To guide LLM to act as the intelligent agent towards intelligent VIoT, we resort to ReAct instruction tuning based on the collected VIoT dataset to learn the tool capability. Quantitative and qualitative experimental results and analyses demonstrate the effectiveness of VIoTGPT.
翻译:视频物联网(VIoT)在采集海量视频数据方面展现出巨大潜力。学习调度感知模型并智能分析采集到的视频,将成为VIoT的潜在增长点。本文针对VIoT中细粒度且相互关联的视觉工具使用所带来的挑战,构建了VIoTGPT框架——该框架基于大语言模型(LLMs),能够正确与人类交互、查询知识视频并调用视觉模型以完成复杂任务。为支持VIoTGPT及相关未来工作,我们基于半自动标注精心构建了训练数据集,并建立了涉及三大类共11个代表性视觉模型的基准测试。为引导LLM作为智能体实现智能VIoT,我们基于收集的VIoT数据集采用ReAct指令微调方法,以学习工具使用能力。定量与定性实验及分析结果验证了VIoTGPT的有效性。