Choosing a suitable deep learning architecture for multimodal data fusion is a challenging task, as it requires the effective integration and processing of diverse data types, each with distinct structures and characteristics. In this paper, we introduce MixMAS, a novel framework for sampling-based mixer architecture search tailored to multimodal learning. Our approach automatically selects the optimal MLP-based architecture for a given multimodal machine learning (MML) task. Specifically, MixMAS utilizes a sampling-based micro-benchmarking strategy to explore various combinations of modality-specific encoders, fusion functions, and fusion networks, systematically identifying the architecture that best meets the task's performance metrics.
翻译:为多模态数据融合选择合适的深度学习架构是一项具有挑战性的任务,因为这需要有效整合和处理具有不同结构和特征的多种数据类型。本文提出MixMAS,一种专为多模态学习设计的、基于采样的混合器架构搜索新框架。该方法能自动为给定的多模态机器学习任务选择最优的基于MLP的架构。具体而言,MixMAS采用基于采样的微基准测试策略,探索模态特定编码器、融合函数及融合网络的各种组合,从而系统性地识别出最符合任务性能指标的架构。