The recent advances in language-based generative models have paved the way for the orchestration of multiple generators of different artefact types (text, image, audio, etc.) into one system. Presently, many open-source pre-trained models combine text with other modalities, thus enabling shared vector embeddings to be compared across different generators. Within this context we propose a novel approach to handle multimodal creative tasks using Quality Diversity evolution. Our contribution is a variation of the MAP-Elites algorithm, MAP-Elites with Transverse Assessment (MEliTA), which is tailored for multimodal creative tasks and leverages deep learned models that assess coherence across modalities. MEliTA decouples the artefacts' modalities and promotes cross-pollination between elites. As a test bed for this algorithm, we generate text descriptions and cover images for a hypothetical video game and assign each artefact a unique modality-specific behavioural characteristic. Results indicate that MEliTA can improve text-to-image mappings within the solution space, compared to a baseline MAP-Elites algorithm that strictly treats each image-text pair as one solution. Our approach represents a significant step forward in multimodal bottom-up orchestration and lays the groundwork for more complex systems coordinating multimodal creative agents in the future.
翻译:基于语言生成模型的最新进展,为将不同制品类型(文本、图像、音频等)的多生成器协调至同一系统铺平了道路。当前,许多开源预训练模型将文本与其他模态相结合,从而实现了跨不同生成器的共享向量嵌入比较。在此背景下,我们提出了一种利用质量多样性进化处理多模态创意任务的新方法。本文贡献在于对MAP-Elites算法的变体——横向评估MAP-Elites(MEliTA),该算法专为多模态创意任务设计,并利用深度学习的模型评估跨模态一致性。MEliTA分离了制品的模态,促进了精英间的交叉融合。我们将假设性视频游戏中的文本描述与封面图像生成作为该算法的测试平台,为每个制品分配独特的模态特定行为特征。结果表明,与严格将每个图像-文本对视为单一解的基准MAP-Elites算法相比,MEliTA能改善解空间内的文本到图像映射。本研究方法代表了多模态自底向上协调的重要进展,并为未来协调多模态创意主体的复杂系统奠定了基础。