Clid: Identifying TLS Clients With Unsupervised Learning on Domain Names

In this paper, we introduce Clid, a Transport Layer Security (TLS) client identification tool based on unsupervised learning on domain names in the server name indication (SNI) field. Clid aims to provide some information on a wide range of clients, even though it may not be able to identify a definitive characteristic about each one of the clients. This is a different approach from that of many existing rule-based client identification tools that rely on hardcoded databases to identify granular characteristics of a few clients. Often times, these tools can identify only a small number of clients in a real-world network as their databases grow outdated, which motivates an alternative approach like Clid. For this research, we utilize some 345 million anonymized TLS handshakes collected from a large university campus network. From each handshake, we create a TCP fingerprint that identifies each unique client that corresponds to a physical device on the network. Clid uses Bayesian optimization to find the 'optimal' DBSCAN clustering of clients and domain names for a set of TLS connections. Clid maps each client cluster to one or more domain clusters that are most strongly associated with it based on the frequency and exclusivity of their TLS connections. While learning highly associated domain names of a client may not immediately tell us specific characteristics of the client like its the operating system, manufacturer, or TLS configuration, it may serve as a strong first step to doing so. We evaluate Clid's performance on various subsets of our captured TLS handshakes and on different parameter settings that affect the granularity of identification results. Our experiments show that Clid is able to identify 'strongly associated' domain names for at least 60% of all clients in all our experiments.

翻译：本文提出Clid，一种基于服务器名称指示（SNI）字段域名的无监督学习传输层安全（TLS）客户端识别工具。Clid旨在为广泛客户端提供信息参考，尽管可能无法确定每个客户端的精确特征。这与许多现有基于规则的客户端识别工具形成对比——后者依赖硬编码数据库识别少数客户端的细粒度特征。由于数据库易过时，此类工具在实际网络中往往仅能识别少量客户端，这促使我们开发Clid这类替代方案。本研究利用从大型大学校园网络收集的约3.45亿条匿名TLS握手记录，为每次握手创建TCP指纹以标识网络中对应物理设备的独立客户端。Clid采用贝叶斯优化寻找TLS连接集合中客户端与域名的“最优”DBSCAN聚类，并根据TLS连接的频率与排他性，将每个客户端聚类映射至一个或多个关联性最强的域名聚类。虽然学习客户端高度关联的域名不能直接揭示操作系统、制造商或TLS配置等具体特征，但可作为实现该目标的关键第一步。我们在捕获的TLS握手记录子集及不同参数设置（影响识别结果粒度）上评估Clid性能，实验表明在所有测试中Clid能为至少60%的客户端识别出“强关联”域名。