面向多语言多模态电子商务应用的大语言模型可靠评估 (Towards Reliable Evaluation of Large Language Models for Multilingual and Multimodal E-Commerce Applications)

Large Language Models (LLMs) excel on general-purpose NLP benchmarks, yet their capabilities in specialized domains remain underexplored. In e-commerce, existing evaluations-such as EcomInstruct, ChineseEcomQA, eCeLLM, and Shopping MMLU-suffer from limited task diversity (e.g., lacking product guidance and after-sales issues), limited task modalities (e.g., absence of multimodal data), synthetic or curated data, and a narrow focus on English and Chinese, leaving practitioners without reliable tools to assess models on complex, real-world shopping scenarios. We introduce EcomEval, a comprehensive multilingual and multimodal benchmark for evaluating LLMs in e-commerce. EcomEval covers six categories and 37 tasks (including 8 multimodal tasks), sourced primarily from authentic customer queries and transaction logs, reflecting the noisy and heterogeneous nature of real business interactions. To ensure both quality and scalability of reference answers, we adopt a semi-automatic pipeline in which large models draft candidate responses subsequently reviewed and modified by over 50 expert annotators with strong e-commerce and multilingual expertise. We define difficulty levels for each question and task category by averaging evaluation scores across models with different sizes and capabilities, enabling challenge-oriented and fine-grained assessment. EcomEval also spans seven languages-including five low-resource Southeast Asian languages-offering a multilingual perspective absent from prior work.

翻译：大语言模型在通用自然语言处理基准测试中表现出色，但其在专业领域的能力仍未得到充分探索。在电子商务领域，现有评估方法——如EcomInstruct、ChineseEcomQA、eCeLLM和Shopping MMLU——存在任务多样性有限（例如缺乏产品导购与售后问题）、任务模态有限（例如缺少多模态数据）、使用合成或精选数据，以及过度聚焦于英语和中文等问题，导致从业者缺乏可靠工具来评估模型在复杂现实购物场景中的表现。我们提出了EcomEval，一个用于评估电子商务领域大语言模型的综合性多语言多模态基准。EcomEval涵盖六大类别共37项任务（包括8项多模态任务），其数据主要来源于真实客户查询与交易日志，反映了实际商业交互中嘈杂且异构的特性。为确保参考答案的质量与可扩展性，我们采用半自动化流程：由大模型生成候选回答，再由50余名具备深厚电子商务和多语言专业知识的专家标注员进行审核与修改。我们通过对不同规模和能力的模型评估分数取平均值，为每个问题及任务类别定义了难度等级，从而实现面向挑战的细粒度评估。EcomEval还涵盖七种语言（包括五种低资源东南亚语言），提供了现有研究中缺失的多语言视角。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日