1 May 2017

Machine Learning and Parallel Processing in Julia, Part II

Julia logoIn the previous article, I presented an implementation of Zac Stewart’s simple spam classifier – originally written in Python – in the Julia programming language.  The classifier applies a multinomial naïve Bayes algorithm to learn how to distinguish spam messages from non-spam (a.k.a. ‘ham’) messages.  A sample data set of 55326 messages, each already labelled as ‘spam’ or ‘not spam’, is used for training and validation of the classifier.  In particular, a technique is employed in which the sample set is divided into a ‘training’ part and a ‘cross-validation’ part.  The classifier is first configured, i.e. ‘trained’, based upon the training subset.  Its performance is then tested by using it to classify each example in the cross-validation subset, and comparing its output (‘spam’ or ‘not spam’) against the known ‘truth’ value with which the example has previously been labelled.

Cross-validation is generally used for evaluating and optimising machine learning (ML) systems.  There are many different algorithms available for ML, and each algorithm may have a number of variations and parameters that can be tuned to maximise accuracy.  It can therefore be useful to trial a number of different algorithms and/or parameters to find the most effective model.  By reserving a part of the sample data set for testing, the performance of a trained model can be evaluated.  And to ensure that the results are not skewed by a particular choice of training and testing subsets, a technique called k-fold cross-validation can be used, in which the original sample data set is partitioned into k equal sized subsamples.  Cross-validation is then repeated k times (the ‘folds’), with a single subsample being retained each time as the validation data for testing the model, and the remaining k−1 subsamples being used as training data.  The k results are averaged to produce a final performance evaluation.

Model training and testing is generally computationally expensive compared to subsequent use of the optimised and trained model.  A 6-fold cross-validation of the naïve Bayes algorithm using the full 55326 message data set took just over 180 seconds to run on my computer, i.e. 30 seconds per fold, executing sequentially.  Large-scale production systems address the time problem by using farms of servers with multiple CPUs and graphics processing units (GPUs), which turn out to be very good at the kinds of computations required for ML, operating in parallel. 

But even those of us toying around with ML on ordinary desktop and notebook PCs can parallelise our processing a little, assuming that our machine has a relatively modern CPU with multiple cores, and that we can find some way to take advantage of parallelism in our code. 

K-fold cross-validation is an easy target for parallelisation, since each fold can be evaluated independently of the others.  And Julia has features built-in that are designed to simplify writing code that can execute in parallel, running in multiple processes on either a single machine/CPU, or on multiple networked machines.  So let’s see just how easy it is to use these features.

V. Parallelising the Code

I need to modify my code a little so that it can be parallelised.  When Julia is managing multiple processes, one is a ‘master’ process, while the others are ‘worker’ processes.  The default behaviour is that using statements import modules into all processes, but other than that everything only executes in the master process unless specifically directed otherwise.  To direct that some part of the code is executed in all processes, the @everywhere macro may be used.

In the following code, I direct the Julia runtime to execute the @sk_import macros in every process, because the code they bring in is required by the cross-validation processing (i.e. training and testing steps) that I want Julia to farm out to the worker processes.  The workers will also need access to the definitions of the classification labels HAM and SPAM.  I also need to move the core processing out of the exCrossValidate() function and into a separate function that is itself defined within the context of each worker process.

using DataFrames
using StringEncodings
using ScikitLearn
using ScikitLearn.Pipelines: Pipeline
@everywhere begin
    @ScikitLearn.sk_import feature_extraction.text: CountVectorizer
    @ScikitLearn.sk_import naive_bayes: MultinomialNB
    @ScikitLearn.sk_import metrics: (confusion_matrix, f1_score, classification_report)
end

# ...

@everywhere HAM = "NOT SPAM"
@everywhere SPAM = "SPAM"

# ...

@everywhere function crossval(i::Int64, classNames::Array{String,1}, 
                                trainText::DataArrays.DataArray, trainClass::DataArrays.DataArray, 
                                cvText::DataArrays.DataArray, cvClass::DataArrays.DataArray)
    # Force garbage collection before beginning
    gc()
    @printf("    Fold # %d...\n", i)
    pipeline = ScikitLearn.Skcore.Pipeline([
        ("vectorizer",  CountVectorizer()),
        ("classifier",  MultinomialNB()) ])
    ScikitLearn.fit!(pipeline, trainText, trainClass)
    cvPred = ScikitLearn.predict(pipeline, cvText)
    confusion = confusion_matrix(cvClass, cvPred, labels = classNames)
    score = f1_score(cvClass, cvPred, pos_label = SPAM)
    println(classification_report(cvClass, cvPred, labels = classNames))
    return (score, confusion)
end

Note that although the using statements import modules, they do not modify the namespace of the worker processes in the same way as they do in the master process (i.e. Julia’s ‘main’ module).  I have therefore used fully-qualified names (e.g. DataArrays.DataArray and ScikitLearn.fit!) in code that is designed to run in worker processes.

VI. Running Cross-Validation in Parallel

To run the crossval() function in a worker process, the @spawn macro may be used.  This returns an identifier for the process to which the task is assigned that can subsequently be used in a fetch() function call to retrieve the results.  Note that the Julia runtime manages all of the required queuing, allocation, passing of data, and waiting on completion of each worker.  All I have to do is spawn all of my processes in one loop, and then gather up all of the results in a following loop!

function exCrossValidate(ds::DataFrame, folds::Int64)
    kFold = ScikitLearn.CrossValidation.KFold(size(ds, 1), n_folds = folds)
    proc = Vector(folds)
    # This creates a DataFrame with each row containing a count of the number of
    # records with each unique label, then extracts the labels into an array of strings
    classNames = convert(Array, by(ds, [:class], nrow)[:class])
    classes = length(classNames)
    println("Evaluating model...")
    # Spawn parallel processes
    for (i, (trainIdx, cvIdx)) in enumerate(kFold)
        proc[i] = @spawn crossval(i, classNames, ds[:text][trainIdx], ds[:class][trainIdx], ds[:text][cvIdx], ds[:class][cvIdx])
    end
    # Now gather the results
    scores = []
    totalConfusion = zeros(classes, classes)
    for i in 1:folds
        result = fetch(proc[i])
        push!(scores, result[1])
        totalConfusion += result[2]
    end
    
    # Make a dataframe from the confusion matrix
    colNames = map(x -> convert(Symbol, x), classNames)
    confusionDf = convert(DataFrame, totalConfusion)
    names!(confusionDf, colNames)
    confusionDf[:truth] = classNames

    @printf("Total emails classified: %d\n", size(ds, 1))
    @printf("Average score: %.6f\n", mean(scores))
    @printf("Confusion matrix:\n")
    println(confusionDf)
end

To run this in multiple parallel processes, I first start Julia using the –p flag.  I then run my code as before.  In the example below, two worker processes are created, and six cross-validation folds.  We therefore expect that each worker is going to end up running three cross-validations.

> julia -p 2
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.2-pre+1 (2017-03-06 03:59 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit a6c55c5 (48 days old release-0.5)
|__/                   |  x86_64-linux-gnu

julia> include("p-spamology.jl")
Loading data set...
Done.

julia> exCrossValidate(ds, 6)
Evaluating model...
        From worker 2:      Fold # 1...
        From worker 3:      Fold # 2...
        From worker 2:               precision    recall  f1-score   support
        From worker 2:
        From worker 2:     NOT SPAM       0.87      0.99      0.93      3668
        From worker 2:         SPAM       0.99      0.90      0.95      5553
        From worker 2:
        From worker 2:  avg / total       0.94      0.94      0.94      9221
        From worker 2:
        From worker 2:      Fold # 3...
        From worker 3:               precision    recall  f1-score   support
        From worker 3:
        From worker 3:     NOT SPAM       0.86      0.99      0.92      3620
        From worker 3:         SPAM       1.00      0.90      0.94      5601
        From worker 3:
        From worker 3:  avg / total       0.94      0.93      0.94      9221
        From worker 3:
        From worker 3:      Fold # 4...
        From worker 2:               precision    recall  f1-score   support
        From worker 2:
        From worker 2:     NOT SPAM       0.87      0.99      0.93      3723
        From worker 2:         SPAM       0.99      0.90      0.95      5498
        From worker 2:
        From worker 2:  avg / total       0.95      0.94      0.94      9221
        From worker 2:
        From worker 2:      Fold # 5...
        From worker 3:               precision    recall  f1-score   support
        From worker 3:
        From worker 3:     NOT SPAM       0.86      0.99      0.92      3591
        From worker 3:         SPAM       0.99      0.90      0.94      5630
        From worker 3:
        From worker 3:  avg / total       0.94      0.93      0.93      9221
        From worker 3:
        From worker 3:      Fold # 6...
        From worker 2:               precision    recall  f1-score   support
        From worker 2:
        From worker 2:     NOT SPAM       0.86      0.99      0.92      3551
        From worker 2:         SPAM       1.00      0.90      0.94      5670
        From worker 2:
        From worker 2:  avg / total       0.94      0.93      0.94      9221
        From worker 2:
        From worker 3:               precision    recall  f1-score   support
        From worker 3:
        From worker 3:     NOT SPAM       0.87      0.99      0.92      3685
        From worker 3:         SPAM       0.99      0.90      0.94      5536
        From worker 3:
        From worker 3:  avg / total       0.94      0.93      0.94      9221
        From worker 3:
Total emails classified: 55326
Average score: 0.944388
Confusion matrix:
2×3 DataFrames.DataFrame
│ Row │ NOT SPAM │ SPAM    │ truth      │
├─────┼──────────┼─────────┼────────────┤
│ 1   │ 21655.0  │ 183.0   │ "NOT SPAM" │
│ 2   │ 3365.0   │ 30123.0 │ "SPAM"     │

julia> 

As you can see, again we get the same expected final results.  It is the behaviour of the Julia runtime that is of interest here, however.  The two worker processes are identified as ‘2’ and ‘3’ (the master process is always ‘1’).  Julia handles the output generated by each worker, identifying its source, so we can see that worker 2 has run folds 1, 3 and 5, while worker 3 has run folds 2, 4 and 6.

As for timing, this code took 107 seconds to run, across two processor cores.  That would break down to the same 30 seconds per fold, with three folds per core making up 90 seconds, plus 17 seconds of overhead, presumably taken up by Julia’s management of the worker processes and transfer of data in and out of each process.

VII. Conclusion – Should You Switch to Julia?

So there we have it – some basic machine learning and parallel processing in Julia!

If you are a scientist, engineer or statistician currently using Python, R, Matlab or Octave for your technical computing needs, you might be looking at all this and wondering whether you should switch to Julia.  After all, if it delivers on all of its promises (and it certainly seems to be well on the way) you might never again have to choose between the pros and cons of other languages.

There are, however, some drawbacks to Julia in its current state.  As I said at the outset of the first part of this series, Julia is a young language.  As you might have noticed in the above output transcripts, the latest and greatest stable version right now is 0.5.2 – yes, that’s a zero at the front of that version number.  The language itself is still evolving, so you are going to need to keep your Julia installation updated, and there is no guarantee that code you write today will still run in a year or two.  Error messages can be obtuse, to say the least, and it is sometimes not at all clear exactly which line of code caused an error.  The user base is relatively small. so there is not as much online community support and assistance as you would find with more mature languages.  In terms of development tools, there is a kind-of IDE called Juno, based on the Atom editor – it is actually pretty good if you want something along the lines of the Matlab or Octave desktops, but it is hardly what any serious programmer would regard as a full-featured development and debugging environment.  And if you are going to be collaborating with other workers, then you will probably have difficulty in getting everyone to adopt a new language and way of working.

Having said that, it seems entirely possible that Julia may one day be the gold standard for technical computing to which it aspires, so the sooner you jump on the bandwagon the better prepared you will be if/when that happens.

0 comments:

Post a Comment