We introduce MMIS, a novel dataset designed to advance MultiModal Interior Scene generation and recognition. MMIS consists of nearly 160,000 images. Each image within the dataset is accompanied by its corresponding textual description and an audio recording of that description, providing rich and diverse sources of information for scene generation and recognition. MMIS encompasses a wide range of interior spaces, capturing various styles, layouts, and furnishings. To construct this dataset, we employed careful processes involving the collection of images, the generation of textual descriptions, and corresponding speech annotations. The presented dataset contributes to research in multi-modal representation learning tasks such as image generation, retrieval, captioning, and classification.
翻译:本文提出MMIS,这是一个旨在推进多模态室内场景生成与识别研究的新型数据集。MMIS包含近16万张图像,每张图像均配有相应的文本描述及该描述的音频录制,为场景生成与识别提供了丰富多样的信息来源。该数据集涵盖广泛的室内空间类型,捕捉了多样的风格、布局与陈设。为构建此数据集,我们采用了细致的流程,包括图像采集、文本描述生成以及相应的语音标注。本数据集有助于推动图像生成、检索、描述生成及分类等多模态表示学习任务的研究。