Crop monitoring is essential for precision agriculture, but current systems lack high-level reasoning. We introduce a novel, modular framework that uses a Visual Language Model (VLM) to guide robotic task planning, interleaving input queries with action primitives. We contribute a comprehensive benchmark for short- and long-horizon crop monitoring tasks in monoculture and polyculture environments. Our main results show that VLMs perform robustly for short-horizon tasks (comparable to human success), but exhibit significant performance degradation in challenging long-horizon tasks. Critically, the system fails when relying on noisy semantic maps, demonstrating a key limitation in current VLM context grounding for sustained robotic operations. This work offers a deployable framework and critical insights into VLM capabilities and shortcomings for complex agricultural robotics.
翻译:作物监测对精准农业至关重要,但现有系统缺乏高层推理能力。本文提出一种新颖的模块化框架,利用视觉语言模型(VLM)引导机器人任务规划,通过交替输入查询与动作基元实现操作。我们构建了一个涵盖单作与混作环境下短时域与长时域作物监测任务的综合基准。主要结果表明:VLM在短时域任务中表现稳健(与人类成功率相当),但在具有挑战性的长时域任务中呈现显著性能下降。关键发现是,当依赖含噪声的语义地图时系统会失效,这揭示了当前VLM在持续机器人作业情境落地方面的核心局限。本研究不仅提供了可部署的框架,更为复杂农业机器人领域中VLM的能力与缺陷提供了关键见解。