Many services today massively and continuously produce log files of different and varying formats. These logs are important since they contain information about the application activities, which is necessary for improvements by analyzing the behavior and maintaining the security and stability of the system. It is a common practice to store log files in a compressed form to reduce the sheer size of these files. A compression algorithm identifies frequent patterns in a log file to remove redundant information. This work presents an approach to detect frequent patterns in textual data that can be simultaneously registered during the file compression process with low consumption of resources. The log file can be visualized with the possibility to explore the extracted patterns using metrics based on such properties as frequency, length and root prefixes of the acquired pattern. This allows an analyst to gain the relevant insights more efficiently reducing the need for manual labor-intensive inspection in the log data. The extension of the implemented dictionary-based compression algorithm has the advantage of recognizing patterns in log files of any format and eliminates the need to manually perform preparation for any preprocessing of log files.
翻译:当前大量服务持续产生不同格式且不断变化的日志文件。这些日志至关重要,因为它们包含应用活动的信息,而通过分析行为并维护系统安全与稳定性是改进系统的必要条件。为缩减日志文件的庞大体量,通常以压缩形式存储。压缩算法通过识别日志文件中的频繁模式来消除冗余信息。本研究提出一种在文件压缩过程中同步以低资源消耗检测文本数据频繁模式的方法。基于所获模式的频率、长度及根前缀等属性度量,可实现日志文件可视化并探索提取的模式。这种方法使分析师能够更高效地获取关键洞察,减少对日志数据进行人工密集检查的需求。所实现的基于词典的压缩算法扩展具有识别任意格式日志文件模式的优势,消除了手动预处理日志文件的准备工作。