Practical Machine Learning in InfoSecurity

Practical Machine Learning in InfoSecurity

From BruCON 2017

Jump to: navigation, search

This lab session is designed to give attendees a quick introduction to ML concepts and gets up and running with the popular machine learning library, sci-kit learn.

We first start by building a basic understanding of how to integrate ML into an email spam identification system. We look at the inner workings and discuss the components involved in the system. Using the training data, we train our system to identify genuine messages and the system automatically learns from these examples. Different classifiers are tuned to get the maximum efficiency we can crunch out from this setup.

Once we have an efficient system, we do a deep dive and look at how one can trick the system to fail, again by using ML techniques.

Machine Learning (ML) is the future. Systems we use today use ML extensively, whether it is powering an e-commerce website or fraud detection in banking. However, it takes the average developer and security professional some level of skill and experience to apply machine learning and get useful results. It is a skill that anyone can learn, but we feel that material in this space is greatly lacking.

We give students a gentle introduction to the topic with the classic boolean classification problem and introduce classifiers, which are at the core of many of the most common ML systems. We deal with some easy to implement classifiers in sci-kit learn (linear classifiers, decision trees etc.), and show visualizations on how it works.

We then dive into training our classifiers with a labelled dataset. Trying different classifiers to approach the problem and verify the accuracy by cross verifying with the test data helps us choose an ideal algorithm for the problem in hand. This lab servers as a quick and practical introduction to the world of machine learning.

In addition, we guide the student through a simple example of deploying security machine learning systems in production pipelines in a distributed and scalable fashion using Apache Spark. Lastly, we will touch on ways that such systems can be poisoned, misguided, and utterly broken if the architects and implementers are not careful.