Jewelry recognition is a complex task due to the different styles and designs of accessories. Precise descriptions of the various accessories is something that today can only be achieved by experts in the field of jewelry. In this work, we propose an approach for jewelry recognition using computer vision techniques and image captioning, trying to simulate this expert human behavior of analyzing accessories. The proposed methodology consist on using different image captioning models to detect the jewels from an image and generate a natural language description of the accessory. Then, this description is also utilized to classify the accessories at different levels of detail. The generated caption includes details such as the type of jewel, color, material, and design. To demonstrate the effectiveness of the proposed method in accurately recognizing different types of jewels, a dataset consisting of images of accessories belonging to jewelry stores in C\'ordoba (Spain) has been created. After testing the different image captioning architectures designed, the final model achieves a captioning accuracy of 95\%. The proposed methodology has the potential to be used in various applications such as jewelry e-commerce, inventory management or automatic jewels recognition to analyze people's tastes and social status.
翻译:珠宝识别是一项复杂任务,原因在于配饰存在多种风格与设计。目前,唯有珠宝领域的专家才能对各类配饰进行精确描述。本研究提出一种融合计算机视觉与图像描述技术的珠宝识别方法,旨在模拟人类专家分析配饰的认知行为。该方法采用不同图像描述模型从图像中检测珠宝,并生成配饰的自然语言描述,进而利用该描述在多个细节层级上对配饰进行分类。生成的描述包含珠宝类型、颜色、材质与设计等细节。为验证所提方法在精确识别不同珠宝类型方面的有效性,我们构建了一个包含西班牙科尔多瓦珠宝店铺配饰图像的专用数据集。经测试多种图像描述架构,最终模型实现了95%的描述准确率。该方法在珠宝电商、库存管理及基于品位与社会地位分析的自动珠宝识别等应用场景中具有广阔前景。