The increasing prevalence of malicious Portable Document Format (PDF) files necessitates robust and comprehensive feature extraction techniques for effective detection and analysis. This work presents a unified framework that integrates graph-based, structural, and metadata-driven analysis to generate a rich feature representation for each PDF document. The system extracts text from PDF pages and constructs undirected graphs based on pairwise word relationships, enabling the computation of graph-theoretic features such as node count, edge density, and clustering coefficient. Simultaneously, the framework parses embedded metadata to quantify character distributions, entropy patterns, and inconsistencies across fields such as author, title, and producer. Temporal features are derived from creation and modification timestamps to capture behavioral signatures, while structural elements including, object streams, fonts, and embedded images, are quantified to reflect document complexity. Boolean flags for potentially malicious PDF constructs (e.g., JavaScript, launch actions) are also extracted. Together, these features form a high-dimensional vector representation (170 dimensions) that is well-suited for downstream tasks such as malware classification, anomaly detection, and forensic analysis. The proposed approach is scalable, extensible, and designed to support real-world PDF threat intelligence workflows.6
翻译:随着恶意便携式文档格式(PDF)文件的日益普遍,需要采用鲁棒且全面的特征提取技术以实现有效的检测与分析。本研究提出一种统一框架,该框架集成了基于图的分析、结构分析和元数据驱动分析,为每个PDF文档生成丰富的特征表示。该系统从PDF页面提取文本,并基于词对关系构建无向图,从而能够计算图论特征,如节点数量、边密度和聚类系数。同时,该框架解析嵌入的元数据,以量化字符分布、熵模式以及跨字段(如作者、标题和生成器)的不一致性。从创建和修改时间戳中提取时间特征以捕获行为特征,同时对包括对象流、字体和嵌入图像在内的结构元素进行量化以反映文档复杂性。此外,还提取了潜在恶意PDF结构(例如JavaScript、启动动作)的布尔标志。这些特征共同构成了一个高维向量表示(170维),非常适用于下游任务,如恶意软件分类、异常检测和取证分析。所提出的方法具有可扩展性和可扩展性,旨在支持现实世界的PDF威胁情报工作流程。