We propose a scaling law hypothesis for multimodal models processing text, audio, images, and video within a shared token and embedding space. Our framework predicts model performance based on modality-specific compression and tokenization efficiency, extending established scaling laws from text-based decoder models to mixed-modality systems. We explore whether leveraging more training data in multiple modalities can reduce the size of the multimodal model, enabling efficient deployment on resource-constrained devices.
翻译:我们针对在共享标记与嵌入空间中处理文本、音频、图像和视频的多模态模型提出了一个缩放律假设。该框架基于模态特定的压缩与标记化效率来预测模型性能,将基于文本的解码器模型的既定缩放律扩展至混合模态系统。我们探讨了利用更多多模态训练数据是否能够减小多模态模型的规模,从而实现在资源受限设备上的高效部署。