Python Plugins – Using scikit-learn for Outlier Detection

Machine learning is becoming ever more useful in data processing, and with Apama’s new Python plug-in capability it is now even easier to use this from within EPL. There are various machine learning libraries available for use, such as TensorFlow and scikit-learn. We’ve chosen to create this demo using scikit-learn, as an example of outlier detection using this library already exists. We’ll be basing this demo on the example (found here).

This demo will train several classifiers on a subset of the Boston Housing Dataset. It will then receive a series of events and check each one to see if it is considered an outlier by each classifier. The results will be output in the log.

The demo will be created within Software AG Designer. Steps for setting this up can be found here and in this video tutorial

The full source for this demo can be found here.

Setup

In order to run this sample, several libraries are required. In Designer, open Window > Preferences > PyDev > Interpreters > Python Interpreters. Select the interpreter you set up and click the ‘Install/Uninstall with pip’ button.

Install scikit-learn by running the command: install scikit-learn

Install Numpy by running the command: install numpy

The Sample

Begin by creating a monitor file to encapsulate the EPL logic. This file will load and initialize a plug-in and pass events to the plug-in for analysis.

At the top of the monitor file, create an event. This will represent the housing data that is sent into the system.

Then create your monitor. This monitor listens for the event we created above, and checks these events to see if they are outliers.

With the wrapper monitor file created, it’s time to create the plug-in. Begin by importing the relevant modules.

In your class initialization, create some classifiers to use for outlier detection, and load the training data.

Create a training function to train each classifier on the loaded data.

Finally, create the function to check if a given piece of housing data is an outlier or not.

Begin by extracting the data from the EPL event and storing it in the correct format. Then, for each classifier, run the predict function, which will return a list of values representing whether or not it considered each item of data an outlier. Since we are only passing in one piece of data, we can just read back the first entry in the list. The result will be -1 for an outlier, or 1 for an inlier. Store the result as a boolean along with the name of the classifier in a dictionary. Once each classifier has been run, return the results to EPL.

Since the result stored in predictions is a numpy.int32 or numpy.int64, comparing the value to -1 will return a numpy.bool_. EPL can’t implicitly cast this to a boolean, and will throw an error for a return value of the wrong type To avoid this, cast the result of predictions[0] == -1 to a boolean before adding it to the results dictionary.

With this, your application should be ready to run. Create an event file to send in some events, some of which are outliers, to see the results.

You should see some output like below.

And with that, we’ve performed outlier detection in EPL using a Python plug-in. This demonstrates some of the powerful possibilities of being able to use Python from an Apama correlator.

– Antony