Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost. To address this, we propose EffiMiniVLM, a compact dual-encoder vision-language regression framework that integrates an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head. To improve training sample efficiency, we introduce a weighted Huber loss that leverages rating counts to emphasize more reliable samples, yielding consistent performance gains. Trained using only 20% of the Amazon Reviews 2023 dataset, the proposed model contains 27.7M parameters and requires 6.8 GFLOPs, yet achieves a CES score of 0.40 with the lowest resource cost in the benchmark. Despite its small size, it remains competitive with significantly larger models, achieving comparable performance while being approximately 4x to 8x more resource-efficient than other top-5 methods and being the only approach that does not use external datasets. Further analysis shows that scaling the data to 40% alone allows our model to overtake other methods, which use larger models and datasets, highlighting strong scalability despite the model's compact design.
翻译:基于多模态商品信息预测产品质量在冷启动场景中至关重要,此时用户交互历史不可用,预测必须依赖图像和文本元数据。然而,现有视觉语言模型通常依赖于大型架构和/或大量外部数据集,导致计算成本高昂。为此,我们提出EffiMiniVLM,一种紧凑型双编码器视觉语言回归框架,集成了EfficientNet-B0图像编码器、基于MiniLM的文本编码器以及轻量级回归头。为提升训练样本效率,我们引入加权Huber损失,利用评分数量增强可靠样本权重,从而获得一致的性能提升。该模型仅使用Amazon Reviews 2023数据集的20%进行训练,包含277万参数、需6.8 GFLOPs,却以基准测试中最低资源成本达到0.40的CES分数。尽管规模小巧,该模型仍能与显著更大的模型竞争,在性能相当的情况下,资源效率约为其他前五名方法的4至8倍,且是唯一未使用外部数据集的方法。进一步分析表明,仅将数据扩展至40%,我们的模型便能超越使用更大模型和数据集的其他方法,凸显了紧凑设计下强大的可扩展性。