Although recipe data are very easy to come by nowadays, it is really hard to find a complete recipe dataset - with a list of ingredients, nutrient values per ingredient, and per recipe, allergens, etc. Recipe datasets are usually collected from social media websites where users post and publish recipes. Usually written with little to no structure, using both standardized and non-standardized units of measurement. We collect six different recipe datasets, publicly available, in different formats, and some including data in different languages. Bringing all of these datasets to the needed format for applying a machine learning (ML) pipeline for nutrient prediction [1], [2], includes data normalization using dictionary-based named entity recognition (NER), rule-based NER, as well as conversions using external domain-specific resources. From the list of ingredients, domain-specific embeddings are created using the same embedding space for all recipes - one ingredient dataset is generated. The result from this normalization process is two corpora - one with predefined ingredient embeddings and one with predefined recipe embeddings. On all six recipe datasets, the ML pipeline is evaluated. The results from this use case also confirm that the embeddings merged using the domain heuristic yield better results than the baselines.
翻译:尽管如今获取食谱数据非常容易,但要找到一个完整的食谱数据集——包含食材清单、每种食材的营养成分、每道菜的营养价值及过敏原信息等——仍然十分困难。食谱数据集通常从用户发布食谱的社交媒体网站收集而来,这些数据往往缺乏结构化,并同时使用标准化和非标准化的计量单位。我们收集了六个不同格式、部分包含多语言数据的公开食谱数据集。为将这些数据集转换为适用于营养预测机器学习管道的格式[1][2],我们采用了基于字典的命名实体识别、基于规则的命名实体识别,以及利用外部领域特定资源的格式转换。基于食材清单,我们使用统一的嵌入空间为所有食谱生成领域特定嵌入,最终形成一个食材数据集。这一标准化过程产生了两个语料库——一个包含预定义的食材嵌入,另一个包含预定义的食谱嵌入。我们在全部六个食谱数据集上评估了该机器学习管道,用例结果也证实:采用领域启发式方法合并的嵌入效果优于基线方法。