EffiMiniVLM: A Compact Dual-Encoder Regression Framework

Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost. To address this, we propose EffiMiniVLM, a compact dual-encoder vision-language regression framework that integrates an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head. To improve training sample efficiency, we introduce a weighted Huber loss that leverages rating counts to emphasize more reliable samples, yielding consistent performance gains. Trained using only 20% of the Amazon Reviews 2023 dataset, the proposed model contains 27.7M parameters and requires 6.8 GFLOPs, yet achieves a CES score of 0.40 with the lowest resource cost in the benchmark. Despite its small size, it remains competitive with significantly larger models, achieving comparable performance while being approximately 4x to 8x more resource-efficient than other top-5 methods and being the only approach that does not use external datasets. Further analysis shows that scaling the data to 40% alone allows our model to overtake other methods, which use larger models and datasets, highlighting strong scalability despite the model's compact design.

翻译：基于多模态商品信息预测产品质量在冷启动场景中至关重要，此时用户交互历史不可用，预测必须依赖图像和文本元数据。然而，现有视觉语言模型通常依赖于大型架构和/或大量外部数据集，导致计算成本高昂。为此，我们提出EffiMiniVLM，一种紧凑型双编码器视觉语言回归框架，集成了EfficientNet-B0图像编码器、基于MiniLM的文本编码器以及轻量级回归头。为提升训练样本效率，我们引入加权Huber损失，利用评分数量增强可靠样本权重，从而获得一致的性能提升。该模型仅使用Amazon Reviews 2023数据集的20%进行训练，包含277万参数、需6.8 GFLOPs，却以基准测试中最低资源成本达到0.40的CES分数。尽管规模小巧，该模型仍能与显著更大的模型竞争，在性能相当的情况下，资源效率约为其他前五名方法的4至8倍，且是唯一未使用外部数据集的方法。进一步分析表明，仅将数据扩展至40%，我们的模型便能超越使用更大模型和数据集的其他方法，凸显了紧凑设计下强大的可扩展性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

MiniMax震撼开源，突破传统Transformer架构，4560亿参数，支持400万长上下文

专知会员服务

21+阅读 · 2025年1月15日

【NeurlPS2024】一种适用于跨模态和任务的视觉-语言模型的统一去偏方法

专知会员服务

22+阅读 · 2024年10月11日

Meta-Transformer：多模态学习的统一框架

专知会员服务

59+阅读 · 2023年7月21日

MIMIC-IT:多模态上下文指令调优

专知会员服务

40+阅读 · 2023年6月11日