Large language models (LLMs), such as ChatGPT/GPT-4, have proven to be powerful tools in promoting the user experience as an AI assistant. The continuous works are proposing multi-modal large language models (MLLM), empowering LLMs with the ability to sense multiple modality inputs through constructing a joint semantic space (e.g. visual-text space). Though significant success was achieved in LLMs and MLLMs, exploring LLMs and MLLMs in domain-specific applications that required domain-specific knowledge and expertise has been less conducted, especially for \textbf{marine domain}. Different from general-purpose MLLMs, the marine-specific MLLM is required to yield much more \textbf{sensitive}, \textbf{informative}, and \textbf{scientific} responses. In this work, we demonstrate that the existing MLLMs optimized on huge amounts of readily available general-purpose training data show a minimal ability to understand domain-specific intents and then generate informative and satisfactory responses. To address these issues, we propose \textbf{MarineGPT}, the first vision-language model specially designed for the marine domain, unlocking the secrets of the ocean to the public. We present our \textbf{Marine-5M} dataset with more than 5 million marine image-text pairs to inject domain-specific marine knowledge into our model and achieve better marine vision and language alignment. Our MarineGPT not only pushes the boundaries of marine understanding to the general public but also offers a standard protocol for adapting a general-purpose assistant to downstream domain-specific experts. We pave the way for a wide range of marine applications while setting valuable data and pre-trained models for future research in both academic and industrial communities.
翻译:大型语言模型(LLMs),如ChatGPT/GPT-4,已被证明是提升AI助手用户体验的有力工具。持续的研究工作正在提出多模态大型语言模型(MLLMs),通过构建联合语义空间(例如视觉-文本空间),赋予LLMs感知多模态输入的能力。尽管LLMs和MLLMs取得了显著成功,但在需要领域特定知识与专业技能的专用领域应用中对LLMs和MLLMs的探索仍不足,尤其是在**海洋领域**。与通用型MLLMs不同,海洋专用MLLM需要生成更为**敏感**、**信息丰富**且**科学准确**的响应。本研究表明,现有基于海量通用训练数据优化的MLLMs在理解领域特定意图并生成信息丰富且令人满意的响应方面能力有限。为解决这些问题,我们提出**MarineGPT**——首个专为海洋领域设计的视觉-语言模型,旨在向公众揭示海洋的奥秘。我们构建了包含超过500万海洋图像-文本对的**Marine-5M**数据集,将领域特定的海洋知识注入模型,实现更优的海洋视觉与语言对齐。我们的MarineGPT不仅将海洋理解的边界拓展至公众,还提供了一套将通用型AI助手适配至下游领域专用专家的标准方案。该工作为广泛的海洋应用开辟了道路,同时为学术界和工业界的未来研究提供了宝贵的数据与预训练模型。