LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction

Product attribute value extraction is a pivotal component in Natural Language Processing (NLP) and the contemporary e-commerce industry. The provision of precise product attribute values is fundamental in ensuring high-quality recommendations and enhancing customer satisfaction. The recently emerging Large Language Models (LLMs) have demonstrated state-of-the-art performance in numerous attribute extraction tasks, without the need for domain-specific training data. Nevertheless, varying strengths and weaknesses are exhibited by different LLMs due to the diversity in data, architectures, and hyperparameters. This variation makes them complementary to each other, with no single LLM dominating all others. Considering the diverse strengths and weaknesses of LLMs, it becomes necessary to develop an ensemble method that leverages their complementary potentials. In this paper, we propose a novel algorithm called LLM-ensemble to ensemble different LLMs' outputs for attribute value extraction. We iteratively learn the weights for different LLMs to aggregate the labels with weights to predict the final attribute value. Not only can our proposed method be proven theoretically optimal, but it also ensures efficient computation, fast convergence, and safe deployment. We have also conducted extensive experiments with various state-of-the-art LLMs, including Llama2-13B, Llama2-70B, PaLM-2, GPT-3.5, and GPT-4, on Walmart's internal data. Our offline metrics demonstrate that the LLM-ensemble method outperforms all the state-of-the-art single LLMs on Walmart's internal dataset. This method has been launched in several production models, leading to improved Gross Merchandise Volume (GMV), Click-Through Rate (CTR), Conversion Rate (CVR), and Add-to-Cart Rate (ATC).

翻译：产品属性值提取是自然语言处理（NLP）及当代电子商务行业中的关键组成部分。提供精确的产品属性值对于确保高质量的推荐和提升客户满意度至关重要。近期兴起的大语言模型（LLMs）已在众多属性提取任务中展现出最先进的性能，且无需领域特定的训练数据。然而，由于数据、架构和超参数的多样性，不同的大语言模型展现出各异的优势与不足。这种差异性使得它们彼此互补，没有单一的大语言模型能在所有方面都占据主导地位。考虑到大语言模型多样化的优缺点，开发一种能够利用其互补潜力的集成方法变得十分必要。本文提出了一种名为LLM-ensemble的新颖算法，用于集成不同大语言模型的输出以进行属性值提取。我们通过迭代学习为不同的大语言模型分配权重，从而以加权方式聚合标签来预测最终的属性值。我们提出的方法不仅在理论上可证明是最优的，而且确保了高效的计算、快速的收敛以及安全的部署。我们还在沃尔玛的内部数据上，使用包括Llama2-13B、Llama2-70B、PaLM-2、GPT-3.5和GPT-4在内的多种最先进的大语言模型进行了广泛的实验。我们的离线指标表明，在沃尔玛的内部数据集上，LLM-ensemble方法的性能超越了所有最先进的单一大型语言模型。该方法已在多个生产模型中上线，有效提升了商品交易总额（GMV）、点击率（CTR）、转化率（CVR）以及加购率（ATC）。