1 May 2017

Machine Learning and Parallel Processing in Julia, Part II

Julia logoIn the previous article, I presented an implementation of Zac Stewart’s simple spam classifier – originally written in Python – in the Julia programming language.  The classifier applies a multinomial naïve Bayes algorithm to learn how to distinguish spam messages from non-spam (a.k.a. ‘ham’) messages.  A sample data set of 55326 messages, each already labelled as ‘spam’ or ‘not spam’, is used for training and validation of the classifier.  In particular, a technique is employed in which the sample set is divided into a ‘training’ part and a ‘cross-validation’ part.  The classifier is first configured, i.e. ‘trained’, based upon the training subset.  Its performance is then tested by using it to classify each example in the cross-validation subset, and comparing its output (‘spam’ or ‘not spam’) against the known ‘truth’ value with which the example has previously been labelled.

Cross-validation is generally used for evaluating and optimising machine learning (ML) systems.  There are many different algorithms available for ML, and each algorithm may have a number of variations and parameters that can be tuned to maximise accuracy.  It can therefore be useful to trial a number of different algorithms and/or parameters to find the most effective model.  By reserving a part of the sample data set for testing, the performance of a trained model can be evaluated.  And to ensure that the results are not skewed by a particular choice of training and testing subsets, a technique called k-fold cross-validation can be used, in which the original sample data set is partitioned into k equal sized subsamples.  Cross-validation is then repeated k times (the ‘folds’), with a single subsample being retained each time as the validation data for testing the model, and the remaining k−1 subsamples being used as training data.  The k results are averaged to produce a final performance evaluation.

Model training and testing is generally computationally expensive compared to subsequent use of the optimised and trained model.  A 6-fold cross-validation of the naïve Bayes algorithm using the full 55326 message data set took just over 180 seconds to run on my computer, i.e. 30 seconds per fold, executing sequentially.  Large-scale production systems address the time problem by using farms of servers with multiple CPUs and graphics processing units (GPUs), which turn out to be very good at the kinds of computations required for ML, operating in parallel. 

But even those of us toying around with ML on ordinary desktop and notebook PCs can parallelise our processing a little, assuming that our machine has a relatively modern CPU with multiple cores, and that we can find some way to take advantage of parallelism in our code. 

K-fold cross-validation is an easy target for parallelisation, since each fold can be evaluated independently of the others.  And Julia has features built-in that are designed to simplify writing code that can execute in parallel, running in multiple processes on either a single machine/CPU, or on multiple networked machines.  So let’s see just how easy it is to use these features.

25 April 2017

Machine Learning and Parallel Processing in Julia, Part I

Julia logoJulia is a high-level, high-performance dynamic programming language intended primarily for numerical computing.  It is a young language – development began in 2009, and was first publicly-revealed on Valentine’s day 2012.  Julia has the not-so-modest objectives of being, among many other things, as usable for general programming as Python, as easy for statistics as R, as powerful for linear algebra as Matlab (or its open-source clone Octave), simple to learn, powerful enough to keep serious hackers happy, as fast as compiled statically-typed languages like Fortran and C, and simple to use for distributed/parallel computing.

As Julia’s creators asked in a blog post announcing the language: “All this doesn’t seem like too much to ask for, does it?”

Well, it does all seem a bit too good to be true.  I mean, have these folks never heard that you can’t have your cake and eat it?  There being only one sure way to find out, I have been playing with Julia for the past few weeks.  In fact, the k-means clustering in my recent Patentology blog post on locations of Australian patent applicants was coded in Julia.  Encouraged by this experience, I have decided for the time being to make Julia, rather than Python, R or Octave, my main language for technical computing.  I find that the best way to learn a programming language is to dive right in and start using it for real projects.

So, as a way in to Julia, and to using it to run machine learning (ML) algorithms, I set myself the task of porting Zac Stewart’s simple spam classifier from Python to Julia.  This task is somewhat simplified by the fact that the widely-used Python ML library, scikit-learn, has been integrated into a Julia package.  Currently, parts of the package are implemented in Julia, while other parts actually call on the Python scikit-learn library via the Julia PyCall package.  However, aside from how the ML models and functions are imported into the Julia environment, the ScikitLearn package provides a consistent interface allowing models from both the Julia ecosystem and the Python library to be accessed seamlessly within Julia code.

I do not intend to reproduce Zac Stewart’s spam classification tutorial here, only to present an implementation in Julia.  In the second article in this short series, I will show how this implementation can be easily modified to take advantage of Julia’s parallel processing capabilities.  If you want to understand how the model itself works, as well as being able to compare my code with the Python original, you will need to open up the original post: Document Classification with scikit-learn.  Additionally, if you want to have a go at implementing the model yourself, you will need to download and unpack the body of about 55,000 samples of spam and non-spam emails that it uses to learn how to classify documents.  These samples are available from the following links:
  1. the Enron-Spam (in raw form) data sets; and
  2. the SpamAssassin public corpus.
With the preliminaries out of the way, let’s look at some Julia code!

22 December 2016

My ‘Obsolete’ Tech

TypewriterThe other day I read an article by Megan McArdle at Bloomberg View, entitled Tech Upgrades Just Aren’t That Great Anymore.  In the article, she explains how she came to replace her four-year-old Macbook Pro with a nearly identical Macbook that is – shock, horrornot quite the newest model available!  How could a self-confessed hardcore power user (and gamer) possibly have reached this point?

However, the article made me feel much better about my own tech choices.  As I read, it dawned on me that I had unconsciously arrived at the same conclusion as McArdle about two years ago.  The fact is that even users who demand a lot from their tech just no longer need the latest hardware.  For a number of years now it has been possible to buy a laptop, a smartphone or a tablet with all the processing speed and memory required to do everything you need from them for years to come. 

As McArdle rightly points out, the upgrade cycle is no longer delivering improvements in processor speed and memory.  Instead, what we get is improvement in physical, rather than technical, specifications.  The new generation of devices is thinner and lighter than ever before, but processors are, at most, only marginally faster.  Indeed, in some respects tech specs are going backwards.  McArdle notes the loss of USB and SD ports, memory expansion options, and keyboard quality, all sacrificed in the race to deliver the thinnest devices ever seen.

The problem with trying to improve technical specs is power.  And the problem with power in portable devices is twofold – battery capacity and heat.  I cannot help wondering whether the inevitable consequence of trying to push all of the boundaries – performance, size/weight and battery capacity – in a single device is the debacle that was the Samsung Galaxy Note 7.

I am pleased to say, however, that I do not own any device that is likely to catch fire.  Given that everything I own and use daily is at least two years old, if any of it was going to spontaneously combust then it would probably have done so by now!  So in a world of ever-more-incremental upgrades, just how obsolescent is my tech?