Reverse engineering of protocol message formats is critical for many security applications. Mainstream techniques use dynamic analysis and inherit its low-coverage problem -- the inferred message formats only reflect the features of their inputs. To achieve high coverage, we choose to use static analysis to infer message formats from the implementation of protocol parsers. In this work, we focus on a class of extremely challenging protocols whose formats are described via constraint-enhanced regular expressions and parsed using finite-state machines. Such state machines are often implemented as complicated parsing loops, which are inherently difficult to analyze via conventional static analysis. Our new technique extracts a state machine by regarding each loop iteration as a state and the dependency between loop iterations as state transitions. To achieve high, i.e., path-sensitive, precision but avoid path explosion, the analysis is controlled to merge as many paths as possible based on carefully-designed rules. The evaluation results show that we can infer a state machine and, thus, the message formats, in five minutes with over 90% precision and recall, far better than state of the art. We also applied the state machines to enhance protocol fuzzers, which are improved by 20% to 230% in terms of coverage and detect ten more zero-days compared to baselines.
翻译:协议消息格式的反向工程对许多安全应用至关重要。主流技术采用动态分析,但继承了其低覆盖率问题——推断出的消息格式仅反映输入样本的特征。为实现高覆盖率,我们选择使用静态分析从协议解析器的实现中推断消息格式。在本工作中,我们聚焦于一类极具挑战性的协议,其格式通过约束增强的正则表达式描述,并采用有限状态机进行解析。此类状态机通常以复杂的解析循环实现,传统静态分析方法难以有效分析。我们的新技术通过将每次循环迭代视为一个状态、将循环迭代间的依赖关系视为状态转移,从而提取出状态机。为实现高精度(即路径敏感)的同时避免路径爆炸,分析过程受控地依据精心设计的规则合并尽可能多的路径。评估结果表明,我们能在五分钟内以超过90%的精确率和召回率推断出状态机及相应的消息格式,性能显著优于现有技术。我们还将提取的状态机应用于增强协议模糊测试工具,其覆盖率较基线提升20%至230%,并额外检测出十个零日漏洞。