We introduce Shakti VLM, a family of vision-language models in the capacity of 1B and 4B parameters designed to address data efficiency challenges in multimodal learning. While recent VLMs achieve strong performance through extensive training data, Shakti models leverage architectural innovations to attain competitive results with fewer tokens. Key advancements include QK-Normalization for attention stability, hybrid normalization techniques, and enhanced positional encoding. A three-stage training strategy further optimizes learning efficiency. Evaluations show that Shakti-Shakti-VLM-1B and Shakti-VLM-4B excel in document understanding, Visual Reasoning, OCR extraction, and general multimodal reasoning. Our results highlight that high performance can be achieved through model design and training strategy rather than sheer data volume, making Shakti an efficient solution for enterprise-scale multimodal tasks.
翻译:本文介绍Shakti VLM系列视觉语言模型,该系列包含10亿和40亿参数规模的模型,旨在解决多模态学习中的数据效率挑战。尽管当前视觉语言模型通过海量训练数据实现了优异性能,但Shakti模型通过架构创新以更少的训练标记达到可比性能。核心创新包括:提升注意力稳定性的QK归一化技术、混合归一化方法以及增强型位置编码。采用三阶段训练策略进一步优化学习效率。评估表明,Shakti-VLM-1B和Shakti-VLM-4B在文档理解、视觉推理、OCR信息提取及通用多模态推理任务中表现卓越。研究结果证明,通过模型架构设计与训练策略优化而非单纯扩大数据规模,同样可以实现高性能,这使Shakti成为企业级多模态任务的高效解决方案。