25 April 2017

Machine Learning and Parallel Processing in Julia, Part I

Julia logoJulia is a high-level, high-performance dynamic programming language intended primarily for numerical computing.  It is a young language – development began in 2009, and was first publicly-revealed on Valentine’s day 2012.  Julia has the not-so-modest objectives of being, among many other things, as usable for general programming as Python, as easy for statistics as R, as powerful for linear algebra as Matlab (or its open-source clone Octave), simple to learn, powerful enough to keep serious hackers happy, as fast as compiled statically-typed languages like Fortran and C, and simple to use for distributed/parallel computing.

As Julia’s creators asked in a blog post announcing the language: “All this doesn’t seem like too much to ask for, does it?”

Well, it does all seem a bit too good to be true.  I mean, have these folks never heard that you can’t have your cake and eat it?  There being only one sure way to find out, I have been playing with Julia for the past few weeks.  In fact, the k-means clustering in my recent Patentology blog post on locations of Australian patent applicants was coded in Julia.  Encouraged by this experience, I have decided for the time being to make Julia, rather than Python, R or Octave, my main language for technical computing.  I find that the best way to learn a programming language is to dive right in and start using it for real projects.

So, as a way in to Julia, and to using it to run machine learning (ML) algorithms, I set myself the task of porting Zac Stewart’s simple spam classifier from Python to Julia.  This task is somewhat simplified by the fact that the widely-used Python ML library, scikit-learn, has been integrated into a Julia package.  Currently, parts of the package are implemented in Julia, while other parts actually call on the Python scikit-learn library via the Julia PyCall package.  However, aside from how the ML models and functions are imported into the Julia environment, the ScikitLearn package provides a consistent interface allowing models from both the Julia ecosystem and the Python library to be accessed seamlessly within Julia code.

I do not intend to reproduce Zac Stewart’s spam classification tutorial here, only to present an implementation in Julia.  In the second article in this short series, I will show how this implementation can be easily modified to take advantage of Julia’s parallel processing capabilities.  If you want to understand how the model itself works, as well as being able to compare my code with the Python original, you will need to open up the original post: Document Classification with scikit-learn.  Additionally, if you want to have a go at implementing the model yourself, you will need to download and unpack the body of about 55,000 samples of spam and non-spam emails that it uses to learn how to classify documents.  These samples are available from the following links:
  1. the Enron-Spam (in raw form) data sets; and
  2. the SpamAssassin public corpus.
With the preliminaries out of the way, let’s look at some Julia code!