Large Language Models (LLMs) excel on general-purpose NLP benchmarks, yet their capabilities in specialized domains remain underexplored. In e-commerce, existing evaluations-such as EcomInstruct, ChineseEcomQA, eCeLLM, and Shopping MMLU-suffer from limited task diversity (e.g., lacking product guidance and after-sales issues), limited task modalities (e.g., absence of multimodal data), synthetic or curated data, and a narrow focus on English and Chinese, leaving practitioners without reliable tools to assess models on complex, real-world shopping scenarios. We introduce EcomEval, a comprehensive multilingual and multimodal benchmark for evaluating LLMs in e-commerce. EcomEval covers six categories and 37 tasks (including 8 multimodal tasks), sourced primarily from authentic customer queries and transaction logs, reflecting the noisy and heterogeneous nature of real business interactions. To ensure both quality and scalability of reference answers, we adopt a semi-automatic pipeline in which large models draft candidate responses subsequently reviewed and modified by over 50 expert annotators with strong e-commerce and multilingual expertise. We define difficulty levels for each question and task category by averaging evaluation scores across models with different sizes and capabilities, enabling challenge-oriented and fine-grained assessment. EcomEval also spans seven languages-including five low-resource Southeast Asian languages-offering a multilingual perspective absent from prior work.
翻译:大语言模型在通用自然语言处理基准测试中表现出色,但其在专业领域的能力仍未得到充分探索。在电子商务领域,现有评估方法——如EcomInstruct、ChineseEcomQA、eCeLLM和Shopping MMLU——存在任务多样性有限(例如缺乏产品导购与售后问题)、任务模态有限(例如缺少多模态数据)、使用合成或精选数据,以及过度聚焦于英语和中文等问题,导致从业者缺乏可靠工具来评估模型在复杂现实购物场景中的表现。我们提出了EcomEval,一个用于评估电子商务领域大语言模型的综合性多语言多模态基准。EcomEval涵盖六大类别共37项任务(包括8项多模态任务),其数据主要来源于真实客户查询与交易日志,反映了实际商业交互中嘈杂且异构的特性。为确保参考答案的质量与可扩展性,我们采用半自动化流程:由大模型生成候选回答,再由50余名具备深厚电子商务和多语言专业知识的专家标注员进行审核与修改。我们通过对不同规模和能力的模型评估分数取平均值,为每个问题及任务类别定义了难度等级,从而实现面向挑战的细粒度评估。EcomEval还涵盖七种语言(包括五种低资源东南亚语言),提供了现有研究中缺失的多语言视角。