The attack surface of a modern operating system is a haystack: thousands of signed binaries and millions of functions, almost none relevant to any given vulnerability. A human analyst or an LLM agent must pick the function worth reading before analyzing it. At whole-OS scope, this target selection, not the analysis, is the binding constraint. We present Symbolicate-Enrich-Sample, a low-cost batch pipeline that turns a corpus of production Windows binaries into a queryable, priority-ranked research queue. We (i) recover function-level symbols for stripped vendor binaries by auto-fetching the public symbol files and joining them to a recovered call graph; (ii) attach cheap, deterministic structural features to each named function and, conditioned on those features, use a low-cost language model to assign a reachability tier, a risk level, a bug-class hypothesis, and a rationale; and (iii) draw diverse, prioritized batches via a priority-weighted importance sampler. The contribution is a selection substrate: the prioritization layer a downstream detector or LLM agent runs on top of. Across a whole Windows image of 7,231,419 functions, the labels are markedly selective, and stacking deterministic filters on them leaves a ~22K-function shortlist: the candidate needles, few enough for a human or agent to work through. We characterize the pipeline's selectivity and its failure modes, describe the methodology, and report aggregate statistics; we withhold the derived dataset for legal and dual-use reasons.
翻译:现代操作系统的攻击面如同浩瀚的数据海洋:数以千计的二进制文件和数百万个函数,其中绝大多数与特定漏洞毫无关联。人类分析师或LLM代理在分析前必须筛选出值得研究的函数。在操作系统全局范围内,这种目标筛选比分析本身更具约束性。本文提出Symbolicate-Enrich-Sample低代价批处理流水线,可将生产环境Windows二进制文件集合转化为可查询、优先级排序的研究队列。我们通过以下方法实现:(i) 自动获取公开符号文件并与重构的调用图关联,恢复经过剥离的供应商二进制文件中的函数级符号;(ii) 为每个命名函数附加低成本确定性结构特征,并基于这些特征使用轻量级语言模型分配可达性等级、风险等级、漏洞类别假设及推理依据;(iii) 通过优先级加权重要性采样器生成多样化、优先排序的批处理数据。本研究的核心贡献在于构建了选择基座:下游检测器或LLM代理可在该优先级分层之上运行。在包含7,231,419个函数的完整Windows镜像上,标注结果展现出显著的选择性,通过叠加确定性过滤器可生成约22K函数的候选列表——这些潜在漏洞目标足以供人类或代理系统进行详尽分析。我们描述了流水线的选择特性及其失效模式,阐述了方法论并报告了聚合统计结果;出于法律及双重用途考量,未公开派生数据集。