Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model

from arxiv, 18 pages, 2 figures, 7 tables, 1 algorithm. v2: lm_head alias via Qwen3.5 weight-tying cuts peak GPU memory 41% -> 48% (10.5 -> 9.2 GB); bitwise-identical outputs verified over 50+ greedy samples, 10 decodes at 1024 tokens, 50 mode-switch round-trips. Code: github.com/athrael-soju/hydra ; HF models under huggingface.co/athrael-soju

Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model's generation quality -- byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design cuts peak GPU memory from 17.9 GB (two-model baseline) to 9.2 GB -- a 48% reduction, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.

翻译：视觉文档理解通常需要独立的检索和生成模型，这会使内存与系统复杂度翻倍。我们提出Hydra方法，采用双头架构，通过单个视觉语言模型（VLM）同时实现ColBERT风格的晚期交互检索和自回归生成。一个仅针对检索训练的LoRA适配器在推理时通过切换启用或禁用：启用时生成多向量嵌入；禁用时恢复基础模型的生成质量——在贪婪采样和随机采样共10,500个样本中，100%实现字节级一致的输出；在四个VQA基准测试（三个信息型数据集；在贪婪解码下，ChartQA上两个模型的近零性能）的15,301个样本上，最大Δ-ANLS为0.0044。我们识别出三个工程需求（注意力模式恢复、lm_head保留、KV缓存感知解码），忽略这些需求即使权重恢复正确，也会在无警告的情况下破坏生成质量。在ViDoRe V1上，Hydra（4B）单次训练结果与受控单头基线相差不到1个百分点，且在V2和V3上获得更高聚合分数（集中在部分任务子集），但需多随机种子实验验证趋势。单模型设计将峰值GPU内存从17.9 GB（双模型基线）降至9.2 GB——减少48%，但适配器切换在并发服务负载下引入吞吐量开销。消融实验表明，在基于LoRA（r=16）的训练框架内，GritLM风格的联合训练无额外收益。面向Qwen2.5-Omni-3B的概念验证扩展表明，该机制可泛化至音频检索、视频嵌入及语音生成。