We introduce Web-Scale Multimodal Summarization, a lightweight framework for generating summaries by combining retrieved text and image data from web sources. Given a user-defined topic, the system performs parallel web, news, and image searches. Retrieved images are ranked using a fine-tuned CLIP model to measure semantic alignment with topic and text. Optional BLIP captioning enables image-only summaries for stronger multimodal coherence.The pipeline supports features such as adjustable fetch limits, semantic filtering, summary styling, and downloading structured outputs. We expose the system via a Gradio-based API with controllable parameters and preconfigured presets.Evaluation on 500 image-caption pairs with 20:1 contrastive negatives yields a ROC-AUC of 0.9270, an F1-score of 0.6504, and an accuracy of 96.99%, demonstrating strong multimodal alignment. This work provides a configurable, deployable tool for web-scale summarization that integrates language, retrieval, and vision models in a user-extensible pipeline.
翻译:本文提出Web-Scale Multimodal Summarization,一种通过整合网络检索的文本与图像数据生成摘要的轻量级框架。给定用户定义的主题后,系统并行执行网页、新闻及图像搜索。检索到的图像通过微调CLIP模型进行排序,以衡量其与主题及文本的语义对齐程度。可选的BLIP图像描述功能支持纯图像摘要生成,以增强多模态连贯性。该流水线支持可调节抓取限制、语义过滤、摘要风格化及结构化输出下载等功能。系统通过基于Gradio的API对外提供服务,提供可控参数与预配置方案。在包含500个图像-描述对及20:1对比负样本的数据集上评估显示,ROC-AUC达0.9270,F1分数为0.6504,准确率为96.99%,证明了强大的多模态对齐能力。本研究为Web级摘要任务提供了一个可配置、可部署的工具,将语言模型、检索模型与视觉模型集成于用户可扩展的流水线中。