Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on https://github.com/open-compass/Creation-MMBench.
翻译:创造性是智能的基本维度,指在不同情境中生成新颖且适切解决方案的能力。尽管大型语言模型(LLMs)的创造性能力已得到广泛评估,但多模态大语言模型(MLLMs)在该领域的评估仍处于探索阶段。为填补这一空白,我们提出了Creation-MMBench——一个专门设计用于评估MLLMs在真实世界图像任务中创造能力的多模态基准。该基准涵盖51个细粒度任务,共包含765个测试用例。为确保评估严谨性,我们为每个测试案例定义了实例特定的评估标准,用以指导对整体回答质量及与视觉输入事实一致性的评估。实验结果表明,当前开源MLLMs在创造性任务上的表现显著落后于闭源模型。进一步分析表明,视觉微调可能对基础LLM的创造能力产生负面影响。Creation-MMBench为推进MLLM创造性研究提供了重要洞见,并为未来多模态生成智能的发展奠定了基础。完整数据与评估代码已发布于https://github.com/open-compass/Creation-MMBench。