In the e-commerce domain, the accurate extraction of attribute-value pairs from product listings (e.g., Brand: Apple) is crucial for enhancing search and recommendation systems. The automation of this extraction process is challenging due to the vast diversity of product categories and their respective attributes, compounded by the lack of extensive, accurately annotated training datasets and the demand for low latency to meet the real-time needs of e-commerce platforms. To address these challenges, we introduce GenToC, a novel two-stage model for extracting attribute-value pairs from product titles. GenToC is designed to train with partially-labeled data, leveraging incomplete attribute-value pairs and obviating the need for a fully annotated dataset. Moreover, we introduce a bootstrapping method that enables GenToC to progressively refine and expand its training dataset. This enhancement substantially improves the quality of data available for training other neural network models that are typically faster but are inherently less capable than GenToC in terms of their capacity to handle partially-labeled data. By supplying an enriched dataset for training, GenToC significantly advances the performance of these alternative models, making them more suitable for real-time deployment. Our results highlight the unique capability of GenToC to learn from a limited set of labeled data and to contribute to the training of more efficient models, marking a significant leap forward in the automated extraction of attribute-value pairs from product titles. GenToC has been successfully integrated into India's largest B2B e-commerce platform, IndiaMART.com, achieving a significant increase of 21.1% in recall over the existing deployed system while maintaining a high precision of 89.5% in this challenging task.
翻译:在电商领域中,从商品列表(例如品牌:苹果)中准确提取属性-值对对于提升搜索和推荐系统至关重要。由于产品类别及其属性的高度多样性,加之缺乏大规模、准确标注的训练数据集以及满足电商平台实时性需求的低延迟要求,使得这一提取过程的自动化面临挑战。为解决这些问题,我们提出了GenToC——一种用于从商品标题中提取属性-值对的新型两阶段模型。GenToC专为利用部分标注数据训练而设计,能够利用不完整的属性-值对,无需完整的标注数据集。此外,我们提出了一种自举方法,使GenToC能够逐步完善并扩充其训练数据集。这一改进大幅提升了可用于训练其他神经网络模型的数据质量——这些模型通常速度更快,但在处理部分标注数据的能力上天生弱于GenToC。通过提供更丰富的训练数据集,GenToC显著提升了这些替代模型的性能,使其更适合实时部署。我们的结果凸显了GenToC从有限标注数据中学习以及促进更高效模型训练的能力,标志着从商品标题中自动提取属性-值对的重大突破。GenToC已成功集成至印度最大的B2B电商平台IndiaMART.com,在该挑战性任务中,与现有部署系统相比,召回率显著提升21.1%,同时保持了89.5%的高精确率。