In the e-commerce domain, the accurate extraction of attribute-value pairs (e.g., Brand: Apple) from product titles and user search queries is crucial for enhancing search and recommendation systems. A major challenge with neural models for this task is the lack of high-quality training data, as the annotations for attribute-value pairs in the available datasets are often incomplete. To address this, we introduce GenToC, a model designed for training directly with partially-labeled data, eliminating the necessity for a fully annotated dataset. GenToC employs a marker-augmented generative model to identify potential attributes, followed by a token classification model that determines the associated values for each attribute. GenToC outperforms existing state-of-the-art models, exhibiting upto 56.3% increase in the number of accurate extractions. Furthermore, we utilize GenToC to regenerate the training dataset to expand attribute-value annotations. This bootstrapping substantially improves the data quality for training other standard NER models, which are typically faster but less capable in handling partially-labeled data, enabling them to achieve comparable performance to GenToC. Our results demonstrate GenToC's unique ability to learn from a limited set of partially-labeled data and improve the training of more efficient models, advancing the automated extraction of attribute-value pairs. Finally, our model has been successfully integrated into IndiaMART, India's largest B2B e-commerce platform, achieving a significant increase of 20.2% in the number of correctly identified attribute-value pairs over the existing deployed system while achieving a high precision of 89.5%.
翻译:在电子商务领域,从产品标题和用户搜索查询中准确提取属性-值对(例如品牌:苹果)对于增强搜索和推荐系统至关重要。针对此任务的神经模型面临的主要挑战是缺乏高质量训练数据,因为现有数据集中属性-值对的标注往往不完整。为解决这一问题,我们提出了GenToC模型,该模型专为直接利用部分标注数据进行训练而设计,无需完全标注的数据集。GenToC采用标记增强生成模型来识别潜在属性,随后通过标记分类模型确定每个属性的关联值。GenToC在准确提取数量上相比现有最先进模型提升最高达56.3%。此外,我们利用GenToC重新生成训练数据集以扩展属性-值标注。这种自举方法显著提升了其他标准NER模型的训练数据质量——这些模型通常速度更快但在处理部分标注数据时能力较弱,使其能够达到与GenToC相当的性能。我们的结果表明,GenToC具有从有限的部分标注数据中学习并改进更高效模型训练的独特能力,推动了属性-值对自动提取技术的发展。最后,我们的模型已成功集成到印度最大的B2B电子商务平台IndiaMART中,在保持89.5%高精确率的同时,相比现有部署系统正确识别的属性-值对数量显著提升了20.2%。