Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline - 专知论文

会员服务 ·

0

WEB · 覆盖 · 知识 · 知识 (knowledge) · 图谱 ·

Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

翻译：基于覆盖感知的领域特定供应商发现网络爬虫：一种Web--Knowledge--Web流水线方法

Yijiashun Qi,Yijiazhen Qi,Tanmay Wagh

Identifying the full landscape of small and medium-sized enterprises (SMEs) in specialized industry sectors is critical for supply-chain resilience, yet existing business databases suffer from substantial coverage gaps -- particularly for sub-tier suppliers and firms in emerging niche markets. We propose a \textbf{Web--Knowledge--Web (W$\to$K$\to$W)} pipeline that iteratively (1)~crawls domain-specific web sources to discover candidate supplier entities, (2)~extracts and consolidates structured knowledge into a heterogeneous knowledge graph, and (3)~uses the knowledge graph's topology and coverage signals to guide subsequent crawling toward under-represented regions of the supplier space. To quantify discovery completeness, we introduce a \textbf{coverage estimation framework} inspired by ecological species-richness estimators (Chao1, ACE) adapted for web-entity populations. Experiments on the semiconductor equipment manufacturing sector (NAICS 333242) demonstrate that the W$\to$K$\to$W pipeline achieves the highest precision (0.138) and F1 (0.118) among all methods using the same 213-page crawl budget, building a knowledge graph of 765 entities and 586 relations while reaching peak recall by iteration~3 with only 112 pages.

翻译：识别专业产业领域内中小型企业的完整格局对于供应链韧性至关重要，然而现有的商业数据库存在显著的覆盖缺口——特别是对于次级供应商和新兴利基市场中的企业。我们提出了一种**Web--Knowledge--Web (W→K→W)** 流水线，该流水线迭代地（1）爬取领域特定的网络资源以发现候选供应商实体，（2）提取并整合结构化知识到一个异质知识图谱中，以及（3）利用知识图谱的拓扑结构和覆盖信号来指导后续爬虫朝向供应商空间中代表性不足的区域。为了量化发现的完整性，我们引入了一个**覆盖估计框架**，其灵感来源于适用于网络实体种群的生态学物种丰富度估计方法（Chao1, ACE）。在半导体设备制造领域（NAICS 333242）的实验表明，在使用相同的213页爬取预算的所有方法中，W→K→W流水线实现了最高的精确率（0.138）和F1分数（0.118），构建了一个包含765个实体和586个关系的知识图谱，并且仅用112页就在第3次迭代时达到了峰值召回率。

0

相关内容

WEB

5G垂直行业专网设计及部署白皮书，35页pdf

专知会员服务

35+阅读 · 2021年5月10日

最新《图神经网络知识图谱补全综述论文》A Survey on Graph Neural Networks for Knowledge Graph Completion

最新《图神经网络知识图谱补全综述论文》A Survey on Graph Neural Networks for Knowledge Graph Completion

专知会员服务

137+阅读 · 2020年7月29日

最新《知识图谱复杂问答》综述论文，A Survey on Complex Question Answering over Knowledge Base: Recent Advances and Challenges

最新《知识图谱复杂问答》综述论文，A Survey on Complex Question Answering over Knowledge Base: Recent Advances and Challenges

专知会员服务

74+阅读 · 2020年7月28日

【WWW 2020 】基于关系对抗网络的低资源知识图谱补全，Relation Adversarial Network for Low Resource Knowledge Graph Completion

【WWW 2020 】基于关系对抗网络的低资源知识图谱补全，Relation Adversarial Network for Low Resource Knowledge Graph Completion

专知会员服务

37+阅读 · 2020年6月7日

【论文推荐】 GIANT: Scalable Creation of a Web-scale Ontology，基于web本体的可扩展创建

【论文推荐】 GIANT: Scalable Creation of a Web-scale Ontology，基于web本体的可扩展创建

专知会员服务

21+阅读 · 2020年4月5日

使用深度学习方法解析问题知识图谱存储查询知识点基于医疗垂直领域的对话系统 by Mr.Young GitHub

专知会员服务

44+阅读 · 2020年1月30日

【图机器学习论文】综述：网络表示学习（Network Representation Learning: A Survey）

【图机器学习论文】综述：网络表示学习（Network Representation Learning: A Survey）

专知会员服务

92+阅读 · 2019年12月16日

【从半结构化网页获取知识】Ceres: Harvesting Knowledge from Semi-Structured web pages，亚马逊首席科学家| Xin Luna Dong

【从半结构化网页获取知识】Ceres: Harvesting Knowledge from Semi-Structured web pages，亚马逊首席科学家| Xin Luna Dong

专知会员服务

19+阅读 · 2019年12月13日

【NLP| 推荐文章】知识图谱问答系统的神经网络方法介绍（Introduction to Neural Network based Approaches for Question Answering over Knowledge Graphs）

专知会员服务

59+阅读 · 2019年11月24日

【CIKM 2019 Tutorial】Enterprise Knowledge Graph From Specific Business Task to Enterprise Knowledge Management(企业知识图谱：从特定业务任务到企业知识管理)，华为 Rong Duan ，复旦大学肖仰华，附139页PPT

【CIKM 2019 Tutorial】Enterprise Knowledge Graph From Specific Business Task to Enterprise Knowledge Management(企业知识图谱：从特定业务任务到企业知识管理)，华为 Rong Duan ，复旦大学肖仰华，附139页PPT

专知会员服务

38+阅读 · 2019年11月3日

搜索query意图识别的演进

搜索query意图识别的演进

DataFunTalk

13+阅读 · 2020年11月15日

【泡泡图灵智库】解释PointNet：PointNet网络内部到底学习到了什么？

【泡泡图灵智库】解释PointNet：PointNet网络内部到底学习到了什么？

泡泡机器人SLAM

13+阅读 · 2019年10月14日

领域应用 | 到底什么时候使用图数据库？

领域应用 | 到底什么时候使用图数据库？

开放知识图谱

16+阅读 · 2019年4月19日

【论文笔记和代码梳理】RippleNet：基于知识图谱的用户偏好传播

【论文笔记和代码梳理】RippleNet：基于知识图谱的用户偏好传播

专知

42+阅读 · 2019年4月9日

【知识图谱】基于知识图谱的用户画像技术

【知识图谱】基于知识图谱的用户画像技术

产业智能官

103+阅读 · 2019年1月9日

【泡泡图灵智库】MapNet：一种便于动态更新的全局地图存储方法（CVPR）

【泡泡图灵智库】MapNet：一种便于动态更新的全局地图存储方法（CVPR）

泡泡机器人SLAM

11+阅读 · 2018年12月10日

网络表示学习介绍

网络表示学习介绍

人工智能前沿讲习班

18+阅读 · 2018年11月26日

【工业大数据】工业大数据始于业务止于业务、车间物联网数据管理、面向产品全寿期的xBOM、构建制造型企业新型能力

【工业大数据】工业大数据始于业务止于业务、车间物联网数据管理、面向产品全寿期的xBOM、构建制造型企业新型能力

产业智能官

12+阅读 · 2018年10月22日

我是一个爬虫

我是一个爬虫

码农翻身

12+阅读 · 2018年6月4日

干货 | Python 爬虫的工具列表大全

干货 | Python 爬虫的工具列表大全

机器学习算法与Python学习

11+阅读 · 2018年4月13日

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

面向移动互联网流量的行为特征和自适应分类方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向知识库的实体链接技术研究

国家自然科学基金

13+阅读 · 2015年12月31日

基于关键词的大规模链接数据搜索技术研究

国家自然科学基金

7+阅读 · 2015年12月31日

以用户为中心的电子商务大数据偏好查询处理与优化

国家自然科学基金

0+阅读 · 2015年12月31日

面向事件检测的感知数据处理方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

关联规则集上的知识发现

国家自然科学基金

9+阅读 · 2015年12月31日

面向异构信息网络中实体归类的模糊聚类

国家自然科学基金

1+阅读 · 2015年12月31日

Web页面数据对象的感知理解与计算

国家自然科学基金

0+阅读 · 2014年12月31日

基于领域知识和链路预测的个性化推荐研究

国家自然科学基金

4+阅读 · 2014年12月31日

K-CARE: Knowledge-driven Symmetrical Contextual Anchoring and Analogical Prototype Reasoning for E-commerce Relevance

Arxiv

0+阅读 · 4月28日

Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

Arxiv

0+阅读 · 4月5日

GRank: Towards Target-Aware and Streamlined Industrial Retrieval with a Generate-Rank Framework

Arxiv

0+阅读 · 4月1日

Scalable Prompt Routing via Fine-Grained Latent Task Discovery

Arxiv

0+阅读 · 3月23日

WebNavigator: Global Web Navigation via Interaction Graph Retrieval

Arxiv

0+阅读 · 3月20日

CoverageBench: Evaluating Information Coverage across Tasks and Domains

Arxiv

0+阅读 · 3月20日

GRank: Towards Target-Aware and Streamlined Industrial Retrieval with a Generate-Rank Framework

Arxiv

0+阅读 · 3月19日

Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

Arxiv

0+阅读 · 3月16日

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

Arxiv

0+阅读 · 3月10日

FedSDAF: Leveraging Source Domain Awareness for Enhanced Federated Domain Generalization

Arxiv

0+阅读 · 2月22日

VIP会员

文章信息

相关主题

知识 (knowledge)

最新内容

《火线上的后勤保障：对抗环境下的随机规划模型研究——俄乌场景案例分析》99页

《火线上的后勤保障：对抗环境下的随机规划模型研究——俄乌场景案例分析》99页

专知会员服务

10+阅读 · 7月16日

《无人地面战车（UGV）的崛起》报告

《无人地面战车（UGV）的崛起》报告

专知会员服务

6+阅读 · 7月16日

《无人机参数化与集群飞行创新项目的监控流程管理：模型、策略及自适应解决方案》

《无人机参数化与集群飞行创新项目的监控流程管理：模型、策略及自适应解决方案》

专知会员服务

5+阅读 · 7月16日

《美军开放式任务系统（OMS）定义与文档（D&D）——Java关键抽象层（CAL）接口生成规范》47页标准

《美军开放式任务系统（OMS）定义与文档（D&D）——Java关键抽象层（CAL）接口生成规范》47页标准

专知会员服务

11+阅读 · 7月16日

美陆军任务式指挥人工智能解决方案

美陆军任务式指挥人工智能解决方案

专知会员服务

10+阅读 · 7月16日

ICML 2026 | 理论级自动形式化：从孤立命题到统一形式化知识库

ICML 2026 | 理论级自动形式化：从孤立命题到统一形式化知识库

专知会员服务

7+阅读 · 7月16日

综述 | 现代智能体自我改进，从模型更新到脚手架演化

综述 | 现代智能体自我改进，从模型更新到脚手架演化

专知会员服务

13+阅读 · 7月16日

美国陆军宣布“项目融合-顶点6”：现代化进程的关键里程碑

美国陆军宣布“项目融合-顶点6”：现代化进程的关键里程碑

专知会员服务

12+阅读 · 7月15日

五角大楼新版反无人机手册：内容解析与战略影响（附手册100页原件）

五角大楼新版反无人机手册：内容解析与战略影响（附手册100页原件）

专知会员服务

15+阅读 · 7月15日

《军事基地能源韧性与经济性权衡评估方法研究》

《军事基地能源韧性与经济性权衡评估方法研究》

专知会员服务

7+阅读 · 7月15日

ACM MM 2026 | UNIT：释放大语言模型在图持续学习中的潜力

ACM MM 2026 | UNIT：释放大语言模型在图持续学习中的潜力

专知会员服务

9+阅读 · 7月15日

综述 | 具身视觉语言导航：系统综述与真实世界评测

综述 | 具身视觉语言导航：系统综述与真实世界评测

专知会员服务

12+阅读 · 7月15日

应对第1、2类无人机威胁的推荐战术、技术与程序

应对第1、2类无人机威胁的推荐战术、技术与程序

专知会员服务

12+阅读 · 7月15日

《反制多无人机集群攻城：序贯斯塔克伯格安全博弈方法研究》59页

《反制多无人机集群攻城：序贯斯塔克伯格安全博弈方法研究》59页

专知会员服务

13+阅读 · 7月15日

博士论文 | 可扩展、自我改进的大语言模型智能体

博士论文 | 可扩展、自我改进的大语言模型智能体

专知会员服务

15+阅读 · 7月14日

相关VIP内容

5G垂直行业专网设计及部署白皮书，35页pdf

专知会员服务

35+阅读 · 2021年5月10日

最新《图神经网络知识图谱补全综述论文》A Survey on Graph Neural Networks for Knowledge Graph Completion

最新《图神经网络知识图谱补全综述论文》A Survey on Graph Neural Networks for Knowledge Graph Completion

专知会员服务

137+阅读 · 2020年7月29日

最新《知识图谱复杂问答》综述论文，A Survey on Complex Question Answering over Knowledge Base: Recent Advances and Challenges

最新《知识图谱复杂问答》综述论文，A Survey on Complex Question Answering over Knowledge Base: Recent Advances and Challenges

专知会员服务

74+阅读 · 2020年7月28日

【WWW 2020 】基于关系对抗网络的低资源知识图谱补全，Relation Adversarial Network for Low Resource Knowledge Graph Completion

【WWW 2020 】基于关系对抗网络的低资源知识图谱补全，Relation Adversarial Network for Low Resource Knowledge Graph Completion

专知会员服务

37+阅读 · 2020年6月7日

【论文推荐】 GIANT: Scalable Creation of a Web-scale Ontology，基于web本体的可扩展创建

【论文推荐】 GIANT: Scalable Creation of a Web-scale Ontology，基于web本体的可扩展创建

专知会员服务

21+阅读 · 2020年4月5日

使用深度学习方法解析问题知识图谱存储查询知识点基于医疗垂直领域的对话系统 by Mr.Young GitHub

专知会员服务

44+阅读 · 2020年1月30日

【图机器学习论文】综述：网络表示学习（Network Representation Learning: A Survey）

【图机器学习论文】综述：网络表示学习（Network Representation Learning: A Survey）

专知会员服务

92+阅读 · 2019年12月16日

【从半结构化网页获取知识】Ceres: Harvesting Knowledge from Semi-Structured web pages，亚马逊首席科学家| Xin Luna Dong

【从半结构化网页获取知识】Ceres: Harvesting Knowledge from Semi-Structured web pages，亚马逊首席科学家| Xin Luna Dong

专知会员服务

19+阅读 · 2019年12月13日

【NLP| 推荐文章】知识图谱问答系统的神经网络方法介绍（Introduction to Neural Network based Approaches for Question Answering over Knowledge Graphs）

专知会员服务

59+阅读 · 2019年11月24日

【CIKM 2019 Tutorial】Enterprise Knowledge Graph From Specific Business Task to Enterprise Knowledge Management(企业知识图谱：从特定业务任务到企业知识管理)，华为 Rong Duan ，复旦大学肖仰华，附139页PPT

【CIKM 2019 Tutorial】Enterprise Knowledge Graph From Specific Business Task to Enterprise Knowledge Management(企业知识图谱：从特定业务任务到企业知识管理)，华为 Rong Duan ，复旦大学肖仰华，附139页PPT

专知会员服务

38+阅读 · 2019年11月3日

热门VIP内容

开通专知VIP会员享更多权益服务

《无人地面战车（UGV）的崛起》报告

《美军开放式任务系统（OMS）定义与文档（D&D）——Java关键抽象层（CAL）接口生成规范》47页标准

《火线上的后勤保障：对抗环境下的随机规划模型研究——俄乌场景案例分析》99页

《无人机参数化与集群飞行创新项目的监控流程管理：模型、策略及自适应解决方案》

相关资讯

搜索query意图识别的演进

搜索query意图识别的演进

DataFunTalk

13+阅读 · 2020年11月15日

【泡泡图灵智库】解释PointNet：PointNet网络内部到底学习到了什么？

【泡泡图灵智库】解释PointNet：PointNet网络内部到底学习到了什么？

泡泡机器人SLAM

13+阅读 · 2019年10月14日

领域应用 | 到底什么时候使用图数据库？

领域应用 | 到底什么时候使用图数据库？

开放知识图谱

16+阅读 · 2019年4月19日

【论文笔记和代码梳理】RippleNet：基于知识图谱的用户偏好传播

【论文笔记和代码梳理】RippleNet：基于知识图谱的用户偏好传播

专知

42+阅读 · 2019年4月9日

【知识图谱】基于知识图谱的用户画像技术

【知识图谱】基于知识图谱的用户画像技术

产业智能官

103+阅读 · 2019年1月9日

【泡泡图灵智库】MapNet：一种便于动态更新的全局地图存储方法（CVPR）

【泡泡图灵智库】MapNet：一种便于动态更新的全局地图存储方法（CVPR）

泡泡机器人SLAM

11+阅读 · 2018年12月10日

网络表示学习介绍

网络表示学习介绍

人工智能前沿讲习班

18+阅读 · 2018年11月26日

【工业大数据】工业大数据始于业务止于业务、车间物联网数据管理、面向产品全寿期的xBOM、构建制造型企业新型能力

【工业大数据】工业大数据始于业务止于业务、车间物联网数据管理、面向产品全寿期的xBOM、构建制造型企业新型能力

产业智能官

12+阅读 · 2018年10月22日

我是一个爬虫

我是一个爬虫

码农翻身

12+阅读 · 2018年6月4日

干货 | Python 爬虫的工具列表大全

干货 | Python 爬虫的工具列表大全

机器学习算法与Python学习

11+阅读 · 2018年4月13日

相关论文

K-CARE: Knowledge-driven Symmetrical Contextual Anchoring and Analogical Prototype Reasoning for E-commerce Relevance

Arxiv

0+阅读 · 4月28日

Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

Arxiv

0+阅读 · 4月5日

GRank: Towards Target-Aware and Streamlined Industrial Retrieval with a Generate-Rank Framework

Arxiv

0+阅读 · 4月1日

Scalable Prompt Routing via Fine-Grained Latent Task Discovery

Arxiv

0+阅读 · 3月23日

WebNavigator: Global Web Navigation via Interaction Graph Retrieval

Arxiv

0+阅读 · 3月20日

CoverageBench: Evaluating Information Coverage across Tasks and Domains

Arxiv

0+阅读 · 3月20日

GRank: Towards Target-Aware and Streamlined Industrial Retrieval with a Generate-Rank Framework

Arxiv

0+阅读 · 3月19日

Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

Arxiv

0+阅读 · 3月16日

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

Arxiv

0+阅读 · 3月10日

FedSDAF: Leveraging Source Domain Awareness for Enhanced Federated Domain Generalization

Arxiv

0+阅读 · 2月22日

相关基金

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

面向移动互联网流量的行为特征和自适应分类方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向知识库的实体链接技术研究

国家自然科学基金

13+阅读 · 2015年12月31日

基于关键词的大规模链接数据搜索技术研究

国家自然科学基金

7+阅读 · 2015年12月31日

以用户为中心的电子商务大数据偏好查询处理与优化

国家自然科学基金

0+阅读 · 2015年12月31日

面向事件检测的感知数据处理方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

关联规则集上的知识发现

国家自然科学基金

9+阅读 · 2015年12月31日

面向异构信息网络中实体归类的模糊聚类

国家自然科学基金

1+阅读 · 2015年12月31日

Web页面数据对象的感知理解与计算

国家自然科学基金

0+阅读 · 2014年12月31日

基于领域知识和链路预测的个性化推荐研究

国家自然科学基金

4+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员