GenToC: Leveraging Partially-Labeled Data for Product Attribute-Value Identification

In the e-commerce domain, the accurate extraction of attribute-value pairs from product listings (e.g., Brand: Apple) is crucial for enhancing search and recommendation systems. The automation of this extraction process is challenging due to the vast diversity of product categories and their respective attributes, compounded by the lack of extensive, accurately annotated training datasets and the demand for low latency to meet the real-time needs of e-commerce platforms. To address these challenges, we introduce GenToC, a novel two-stage model for extracting attribute-value pairs from product titles. GenToC is designed to train with partially-labeled data, leveraging incomplete attribute-value pairs and obviating the need for a fully annotated dataset. Moreover, we introduce a bootstrapping method that enables GenToC to progressively refine and expand its training dataset. This enhancement substantially improves the quality of data available for training other neural network models that are typically faster but are inherently less capable than GenToC in terms of their capacity to handle partially-labeled data. By supplying an enriched dataset for training, GenToC significantly advances the performance of these alternative models, making them more suitable for real-time deployment. Our results highlight the unique capability of GenToC to learn from a limited set of labeled data and to contribute to the training of more efficient models, marking a significant leap forward in the automated extraction of attribute-value pairs from product titles. GenToC has been successfully integrated into India's largest B2B e-commerce platform, IndiaMART.com, achieving a significant increase of 21.1% in recall over the existing deployed system while maintaining a high precision of 89.5% in this challenging task.

翻译：在电商领域中，从商品列表（例如品牌：苹果）中准确提取属性-值对对于提升搜索和推荐系统至关重要。由于产品类别及其属性的高度多样性，加之缺乏大规模、准确标注的训练数据集以及满足电商平台实时性需求的低延迟要求，使得这一提取过程的自动化面临挑战。为解决这些问题，我们提出了GenToC——一种用于从商品标题中提取属性-值对的新型两阶段模型。GenToC专为利用部分标注数据训练而设计，能够利用不完整的属性-值对，无需完整的标注数据集。此外，我们提出了一种自举方法，使GenToC能够逐步完善并扩充其训练数据集。这一改进大幅提升了可用于训练其他神经网络模型的数据质量——这些模型通常速度更快，但在处理部分标注数据的能力上天生弱于GenToC。通过提供更丰富的训练数据集，GenToC显著提升了这些替代模型的性能，使其更适合实时部署。我们的结果凸显了GenToC从有限标注数据中学习以及促进更高效模型训练的能力，标志着从商品标题中自动提取属性-值对的重大突破。GenToC已成功集成至印度最大的B2B电商平台IndiaMART.com，在该挑战性任务中，与现有部署系统相比，召回率显著提升21.1%，同时保持了89.5%的高精确率。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日