Detecting malware even when it is encrypted - Machine Learning for network HTTPS analysis
From BruCON 2017
With the increasing amount of malware HTTPS traffic, it is a challenge to discover new features and methods to detect malware without decrypting the traffic. A detection method that does not need to unencrypt the traffic is cheaper (because no traffic interceptor is needed), faster and private, respecting the original idea of HTTPS. Our research goal is to detect malware HTTPS connections using data from Bro IDS logs , that does not need to unencrypt the traffic.
We created and extracted our features from data logs that the Bro IDS is able to generate from a pcap file. Bro offers information about flows, SSL handshakes and X.509 certificates. These three types of data give us enough information to create powerful features and machine learning algorithms to detect the malicious HTTPS traffic with good accuracy.
Our machine learning algorithm uses 30 different features. These features are divided into features for flows, features for SSL handshakes and features for X.509 certificates. One of our main contributions is that our data model is based on connection 4-tuples. A connection 4-tuple aggregates the group of flows which share the same SrcIP, DstIP, DstPort, and protocol. Therefore, each connection summarizes the behavior of the malware while connecting to the same C&C server. Such aggregation proved paramount for the success of our method.
A core part of our research was the production and selection of correct datasets. We used 13 datasets from the CTU-13 malware dataset , 55 malware datasets from the Stratosphere Malware Capture Facility Project (done by Maria Jose Erquiaga) and we produced 20 of our own normal datasets. Each dataset was processed to extract the Bro files from the original pcap files. Afterwards, each dataset was labeled using our expert knowledge. The Amount of malware and normal traffic in our entire dataset is balanced.
Our detection method consisted in using and comparing several machine learning algorithms to learn how the normal HTTPS traffic differs from the malware HTTPS based on our behavioral features. Our results show that malware HTTPS behaviour is distinct from normal HTTPS behaviour and that our methods are able to detect malware with good accuracy without decrypting the traffic.