While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at https://github.com/yayafengzi/ALToLLM.
翻译:人类能够根据视觉对象和形状的复杂性自适应地分配注意力,从而轻松地绘制它们,而现有的多模态大语言模型(MLLMs)仍受限于固定的分词表示。为弥合这一差距,我们提出了ALTo,一种用于自回归掩码生成的自适应长度分词器。为实现此目标,我们设计了一种新颖的分词长度预测器,以及一个长度正则化项和一种可微分的分词分块策略。我们进一步构建了ALToLLM,将ALTo无缝集成到MLLM中。通过组相对策略优化(GRPO)实现了掩码质量与效率之间权衡的偏好设置。实验表明,ALToLLM在流行的分割基准测试中以自适应分词成本取得了最先进的性能。代码和模型发布于https://github.com/yayafengzi/ALToLLM。