In order to facilitate features such as faceted product search and product comparison, e-commerce platforms require accurately structured product data, including precise attribute/value pairs. Vendors often times provide unstructured product descriptions consisting only of an offer title and a textual description. Consequently, extracting attribute values from titles and descriptions is vital for e-commerce platforms. State-of-the-art attribute value extraction methods based on pre-trained language models, such as BERT, face two drawbacks (i) the methods require significant amounts of task-specific training data and (ii) the fine-tuned models have problems with generalising to unseen attribute values that were not part of the training data. This paper explores the potential of using large language models as a more training data-efficient and more robust alternative to existing AVE methods. We propose prompt templates for describing the target attributes of the extraction to the LLM, covering both zero-shot and few-shot scenarios. In the zero-shot scenario, textual and JSON-based target schema representations of the attributes are compared. In the few-shot scenario, we investigate (i) the provision of example attribute values, (ii) the selection of in-context demonstrations, (iii) shuffled ensembling to prevent position bias, and (iv) fine-tuning the LLM. We evaluate the prompt templates in combination with hosted LLMs, such as GPT-3.5 and GPT-4, and open-source LLMs which can be run locally. We compare the performance of the LLMs to the PLM-based methods SU-OpenTag, AVEQA, and MAVEQA. The highest average F1-score of 86% was achieved by GPT-4. Llama-3-70B performs only 3% worse than GPT-4, making it a competitive open-source alternative. Given the same training data, this prompt/GPT-4 combination outperforms the best PLM baseline by an average of 6% F1-score.
翻译:为支持分面产品搜索和产品比较等功能,电子商务平台需要准确的结构化产品数据,包括精确的属性/值对。供应商通常仅提供包含商品标题和文本描述的非结构化产品描述。因此,从标题和描述中提取属性值对电子商务平台至关重要。基于预训练语言模型(如BERT)的最先进属性值提取方法存在两个缺点:(i) 这些方法需要大量特定任务的训练数据;(ii) 微调后的模型难以泛化到训练数据中未出现过的未知属性值。本文探索了使用大语言模型作为现有AVE方法的一种更高效利用训练数据且更鲁棒的替代方案的潜力。我们提出了用于向LLM描述提取目标属性的提示模板,涵盖零样本和少样本场景。在零样本场景中,比较了属性的文本和基于JSON的目标模式表示。在少样本场景中,我们研究了(i) 提供示例属性值,(ii) 上下文示例的选择,(iii) 使用打乱集成以防止位置偏差,以及(iv) 对大语言模型进行微调。我们结合托管LLM(如GPT-3.5和GPT-4)以及可在本地运行的开源LLM评估了这些提示模板。我们将LLM的性能与基于PLM的方法SU-OpenTag、AVEQA和MAVEQA进行了比较。GPT-4取得了最高的平均F1分数86%。Llama-3-70B的性能仅比GPT-4低3%,使其成为一个具有竞争力的开源替代方案。在相同训练数据下,此提示/GPT-4组合的平均F1分数比最佳PLM基线高出6%。