Foundation vision-language models are becoming increasingly relevant to robotics because they can provide richer semantic perception than narrow task-specific pipelines. However, their practical adoption in robot software stacks still depends on reproducible middleware integrations rather than on model quality alone. Florence-2 is especially attractive in this regard because it unifies captioning, optical character recognition, open-vocabulary detection, grounding and related vision-language tasks within a comparatively manageable model size. This article presents a ROS 2 wrapper for Florence-2 that exposes the model through three complementary interaction modes: continuous topic-driven processing, synchronous service calls and asynchronous actions. The wrapper is designed for local execution and supports both native installation and Docker container deployment. It also combines generic JSON outputs with standard ROS 2 message bindings for detection-oriented tasks. A functional validation is reported together with a throughput study on several GPUs, showing that local deployment is feasible with consumer grade hardware. The repository is publicly available here: https://github.com/JEDominguezVidal/florence2_ros2_wrapper
翻译:基础视觉-语言模型正日益与机器人领域相关,因为它们能比狭窄的任务特定流水线提供更丰富的语义感知。然而,它们在实际机器人软件栈中的采纳仍依赖于可复现的中间件集成,而非仅凭模型质量。在此方面,Florence-2尤为引人关注,因为它将图像描述、光学字符识别、开放词汇检测、指代定位及相关视觉-语言任务统一于相对可控的模型规模内。本文提出一种面向Florence-2的ROS 2封装器,通过三种互补交互模式暴露模型:连续主题驱动处理、同步服务调用及异步动作。该封装器专为本地执行设计,同时支持原生安装与Docker容器部署。它还结合了通用JSON输出与面向检测任务的标准化ROS 2消息绑定。本文报告了功能验证结果及在多种GPU上的吞吐量研究,表明在消费级硬件上可实现本地部署。该代码库公开可访问:https://github.com/JEDominguezVidal/florence2_ros2_wrapper