In the previous article, I presented an implementation of Zac Stewart’s simple spam classifier – originally written in Python – in the Julia programming language. The classifier applies a multinomial naïve Bayes algorithm to learn how to distinguish spam messages from non-spam (a.k.a. ‘ham’) messages. A sample data set of 55326 messages, each already labelled as ‘spam’ or ‘not spam’, is used for training and validation of the classifier. In particular, a technique is employed in which the sample set is divided into a ‘training’ part and a ‘cross-validation’ part. The classifier is first configured, i.e. ‘trained’, based upon the training subset. Its performance is then tested by using it to classify each example in the cross-validation subset, and comparing its output (‘spam’ or ‘not spam’) against the known ‘truth’ value with which the example has previously been labelled.
Cross-validation is generally used for evaluating and optimising machine learning (ML) systems. There are many different algorithms available for ML, and each algorithm may have a number of variations and parameters that can be tuned to maximise accuracy. It can therefore be useful to trial a number of different algorithms and/or parameters to find the most effective model. By reserving a part of the sample data set for testing, the performance of a trained model can be evaluated. And to ensure that the results are not skewed by a particular choice of training and testing subsets, a technique called k-fold cross-validation can be used, in which the original sample data set is partitioned into k equal sized subsamples. Cross-validation is then repeated k times (the ‘folds’), with a single subsample being retained each time as the validation data for testing the model, and the remaining k−1 subsamples being used as training data. The k results are averaged to produce a final performance evaluation.
Model training and testing is generally computationally expensive compared to subsequent use of the optimised and trained model. A 6-fold cross-validation of the naïve Bayes algorithm using the full 55326 message data set took just over 180 seconds to run on my computer, i.e. 30 seconds per fold, executing sequentially. Large-scale production systems address the time problem by using farms of servers with multiple CPUs and graphics processing units (GPUs), which turn out to be very good at the kinds of computations required for ML, operating in parallel.
But even those of us toying around with ML on ordinary desktop and notebook PCs can parallelise our processing a little, assuming that our machine has a relatively modern CPU with multiple cores, and that we can find some way to take advantage of parallelism in our code.
K-fold cross-validation is an easy target for parallelisation, since each fold can be evaluated independently of the others. And Julia has features built-in that are designed to simplify writing code that can execute in parallel, running in multiple processes on either a single machine/CPU, or on multiple networked machines. So let’s see just how easy it is to use these features.
Cross-validation is generally used for evaluating and optimising machine learning (ML) systems. There are many different algorithms available for ML, and each algorithm may have a number of variations and parameters that can be tuned to maximise accuracy. It can therefore be useful to trial a number of different algorithms and/or parameters to find the most effective model. By reserving a part of the sample data set for testing, the performance of a trained model can be evaluated. And to ensure that the results are not skewed by a particular choice of training and testing subsets, a technique called k-fold cross-validation can be used, in which the original sample data set is partitioned into k equal sized subsamples. Cross-validation is then repeated k times (the ‘folds’), with a single subsample being retained each time as the validation data for testing the model, and the remaining k−1 subsamples being used as training data. The k results are averaged to produce a final performance evaluation.
Model training and testing is generally computationally expensive compared to subsequent use of the optimised and trained model. A 6-fold cross-validation of the naïve Bayes algorithm using the full 55326 message data set took just over 180 seconds to run on my computer, i.e. 30 seconds per fold, executing sequentially. Large-scale production systems address the time problem by using farms of servers with multiple CPUs and graphics processing units (GPUs), which turn out to be very good at the kinds of computations required for ML, operating in parallel.
But even those of us toying around with ML on ordinary desktop and notebook PCs can parallelise our processing a little, assuming that our machine has a relatively modern CPU with multiple cores, and that we can find some way to take advantage of parallelism in our code.
K-fold cross-validation is an easy target for parallelisation, since each fold can be evaluated independently of the others. And Julia has features built-in that are designed to simplify writing code that can execute in parallel, running in multiple processes on either a single machine/CPU, or on multiple networked machines. So let’s see just how easy it is to use these features.