Recent advances in fine-tuning multimodal large language models (MLLMs) using reinforcement learning have achieved remarkable progress, particularly with the introduction of various entropy control techniques. However, the role and characteristics of entropy in perception-oriented tasks like visual grounding, as well as effective strategies for controlling it, remain largely unexplored. To address this issue, we focus on the visual grounding task and analyze the role and characteristics of entropy in comparison to reasoning tasks. Building on these findings, we introduce ECVGPO (Entropy Control Visual Grounding Policy Optimization), an interpretable algorithm designed for effective entropy regulation. Through entropy control, the trade-off between exploration and exploitation is better balanced. Experiments show that ECVGPO achieves broad improvements across various benchmarks and models.
翻译:近期,通过强化学习对多模态大语言模型(MLLMs)进行微调取得了显著进展,特别是随着各种熵控制技术的引入。然而,在视觉定位这类感知导向任务中,熵的作用与特性,以及控制它的有效策略,在很大程度上仍未得到充分探索。为解决这一问题,我们聚焦于视觉定位任务,并与推理任务进行比较,分析了熵的作用与特性。基于这些发现,我们提出了ECVGPO(熵控制视觉定位策略优化),这是一种为有效熵调节而设计的可解释算法。通过熵控制,探索与利用之间的权衡得到了更好的平衡。实验表明,ECVGPO在各种基准测试和模型上均实现了广泛的性能提升。