Identifying the full landscape of small and medium-sized enterprises (SMEs) in specialized industry sectors is critical for supply-chain resilience, yet existing business databases suffer from substantial coverage gaps -- particularly for sub-tier suppliers and firms in emerging niche markets. We propose a \textbf{Web--Knowledge--Web (W$\to$K$\to$W)} pipeline that iteratively (1)~crawls domain-specific web sources to discover candidate supplier entities, (2)~extracts and consolidates structured knowledge into a heterogeneous knowledge graph, and (3)~uses the knowledge graph's topology and coverage signals to guide subsequent crawling toward under-represented regions of the supplier space. To quantify discovery completeness, we introduce a \textbf{coverage estimation framework} inspired by ecological species-richness estimators (Chao1, ACE) adapted for web-entity populations. Experiments on the semiconductor equipment manufacturing sector (NAICS 333242) demonstrate that the W$\to$K$\to$W pipeline achieves the highest precision (0.138) and F1 (0.118) among all methods using the same 213-page crawl budget, building a knowledge graph of 765 entities and 586 relations while reaching peak recall by iteration~3 with only 112 pages.
翻译:识别专业产业领域内中小型企业的完整格局对于供应链韧性至关重要,然而现有的商业数据库存在显著的覆盖缺口——特别是对于次级供应商和新兴利基市场中的企业。我们提出了一种**Web--Knowledge--Web (W→K→W)** 流水线,该流水线迭代地(1)爬取领域特定的网络资源以发现候选供应商实体,(2)提取并整合结构化知识到一个异质知识图谱中,以及(3)利用知识图谱的拓扑结构和覆盖信号来指导后续爬虫朝向供应商空间中代表性不足的区域。为了量化发现的完整性,我们引入了一个**覆盖估计框架**,其灵感来源于适用于网络实体种群的生态学物种丰富度估计方法(Chao1, ACE)。在半导体设备制造领域(NAICS 333242)的实验表明,在使用相同的213页爬取预算的所有方法中,W→K→W流水线实现了最高的精确率(0.138)和F1分数(0.118),构建了一个包含765个实体和586个关系的知识图谱,并且仅用112页就在第3次迭代时达到了峰值召回率。