25 April 2017

Machine Learning and Parallel Processing in Julia, Part I

Julia logoJulia is a high-level, high-performance dynamic programming language intended primarily for numerical computing.  It is a young language – development began in 2009, and was first publicly-revealed on Valentine’s day 2012.  Julia has the not-so-modest objectives of being, among many other things, as usable for general programming as Python, as easy for statistics as R, as powerful for linear algebra as Matlab (or its open-source clone Octave), simple to learn, powerful enough to keep serious hackers happy, as fast as compiled statically-typed languages like Fortran and C, and simple to use for distributed/parallel computing.

As Julia’s creators asked in a blog post announcing the language: “All this doesn’t seem like too much to ask for, does it?”

Well, it does all seem a bit too good to be true.  I mean, have these folks never heard that you can’t have your cake and eat it?  There being only one sure way to find out, I have been playing with Julia for the past few weeks.  In fact, the k-means clustering in my recent Patentology blog post on locations of Australian patent applicants was coded in Julia.  Encouraged by this experience, I have decided for the time being to make Julia, rather than Python, R or Octave, my main language for technical computing.  I find that the best way to learn a programming language is to dive right in and start using it for real projects.

So, as a way in to Julia, and to using it to run machine learning (ML) algorithms, I set myself the task of porting Zac Stewart’s simple spam classifier from Python to Julia.  This task is somewhat simplified by the fact that the widely-used Python ML library, scikit-learn, has been integrated into a Julia package.  Currently, parts of the package are implemented in Julia, while other parts actually call on the Python scikit-learn library via the Julia PyCall package.  However, aside from how the ML models and functions are imported into the Julia environment, the ScikitLearn package provides a consistent interface allowing models from both the Julia ecosystem and the Python library to be accessed seamlessly within Julia code.

I do not intend to reproduce Zac Stewart’s spam classification tutorial here, only to present an implementation in Julia.  In the second article in this short series, I will show how this implementation can be easily modified to take advantage of Julia’s parallel processing capabilities.  If you want to understand how the model itself works, as well as being able to compare my code with the Python original, you will need to open up the original post: Document Classification with scikit-learn.  Additionally, if you want to have a go at implementing the model yourself, you will need to download and unpack the body of about 55,000 samples of spam and non-spam emails that it uses to learn how to classify documents.  These samples are available from the following links:
  1. the Enron-Spam (in raw form) data sets; and
  2. the SpamAssassin public corpus.
With the preliminaries out of the way, let’s look at some Julia code!

I. Setting Up

The first thing we need to do is to import the various Julia modules that will be used by the program, and define some useful parameters.

#   spamology.jl
#   Based on Zac Stewart's Python spam classifier tutorial code, now in Julia!
#   See: http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html
#   Mark Summerfield, April 2017

using DataFrames
using StringEncodings

using ScikitLearn
using ScikitLearn.Pipelines: Pipeline
@sk_import feature_extraction.text: CountVectorizer
@sk_import naive_bayes: MultinomialNB
@sk_import metrics: (confusion_matrix, f1_score, classification_report)

NEWLINE = "\n"
SKIPFILES = ["cmds"]
SPAMROOT = "path/to/spamdata/dir"
SPAMDATA = "spamdat.csv"

DataFrames are useful for managing tabular data (and provide similar functionality to the data structures of the same name in the Python pandas library).  StringEncodings does pretty much what it says, and I have already introduced ScikitLearn.

The ‘@sk_import’ statements are macros from the ScikitLearn package that import interfaces to components of the Python scikit-learn library.  These are required for each Python model or function that will be called upon within the Julia program.  SPAMROOT specifies the directory containing the sample emails, while SPAMDATA is the name of a comma-separated values (CSV) file that will be used to cache the content extracted from the emails and associated labels (i.e. ‘spam’ or ‘not spam’).

Next, we define the two labels, and a list mapping the directories within our downloaded data set to the type of content.  Each directory contains either ‘spam’ or ‘not spam’ (i.e. ‘ham’) emails.


    ("spam",        SPAM),
    ("easy_ham",    HAM),
    ("hard_ham",    HAM),
    ("beck-s",      HAM),
    ("farmer-d",    HAM),
    ("kaminski-v",  HAM),
    ("kitchen-l",   HAM),
    ("lokay-m",     HAM),
    ("williams-w3", HAM),
    ("BG",          SPAM),
    ("GP",          SPAM),
    ("SH",          SPAM)

II. Reading in the Sample Data

As in the original Python program, we need a function that can walk through the directories containing the sample emails, open each file, and extract the email bodies.

function readFile(path::String)
    for (root, dirs, files) in walkdir(path)
        for filename in files
            if !(filename in SKIPFILES) && filename[1] != '.'
                fullname = joinpath(root, filename)
                if isfile(fullname)
                    pastHeader, lines = false, Vector{String}()
                    open(fullname) do f
                        for line in eachline(f)
                            # Best guess for encoding of input that is not valid unicode is Latin-1
                            if !isvalid(line)
                                line = decode(convert(Array{UInt8,1}, line), "LATIN1")
                            line = chomp(line)
                            if pastHeader
                                push!(lines, line)
                            elseif endof(line) == 0
                                pastHeader = true
                    content = join(lines, NEWLINE)
                    produce((fullname, content))

The above function will operate as a ‘generator’ – the produce() function in Julia corresponds roughly with yield in Python.  Once initialised, each time it is executed (e.g. in a loop) it returns a new pair of values comprising the full name of a file, and the corresponding content.

Note also the decoding for ‘LATIN1’ text.  Julia is a ‘Unicode native’.  All text (including Julia source code) is assumed to be Unicode, encoded as UTF-8 by default.  The emails in the spam sample set are not.  For the most part, this does not matter much, but sometimes we receive a line that contains characters with codes greater than 127, which may not be valid UTF-8.  In this case, we simply assume (for want of a better option) that they are 8-bit Latin-1 text.  Without this conversion (which I concede is a bit ugly) an error is generated.

Next, we need functions to read all of that data into a table containing columns for the email content, its classification (‘spam’ or ‘not spam’) and, for reference, the source filename.  For convenience, I have split this into two functions.

function addData!(df::DataFrame, path::String, classification::String)
    for (filename, text) in Task(()->readFile(path))
        push!(df, @data([text, classification, filename]))

function buildDataSet(sources)
    df = DataFrame(text = Vector{String}(), class = Vector{String}(), index = Vector{String}())
    for (path, classification) in sources
        addData!(df, joinpath(SPAMROOT, path), classification)
    return df

The buildDataSet() function takes a list of directory names and corresponding content classifications (e.g. the SOURCES list defined above) and creates and populates the table (i.e. DataFrame df).  It does this by calling the addData!() function for each directory/classification in the list.  Note that the exclamation point used at the end of a function name is a Julia convention to indicate that the function has side effects, i.e. modifies one or more of its inputs.  In this case, the Dataframe df is modified by addData!().

You can also see the call to the generator function readFile() in the for loop in addData!().  In julia, a co-routine (i.e. our content generator) is explicitly initiated as a new task via the Task() function, which returns an object that can be used with the consume() function (or, in this case, as an iterator) to get each new result from the task.  If you want to understand this in more detail, I suggest reading the documentation.

Now, we do not want all of our data to remain in order, with all of the ‘spam’ and ‘not spam’ messages grouped together, because we will want to use random segments of the data for training and testing.  In the original Python, the pandas library reindex() function is used to shuffle the data.  Julia DataFrames do not have a corresponding function (at least, not that I could find), but there is a sort!() function which makes it simple enough to implement a random shuffle.

function dfShuffle!(df::DataFrame)
    # Shuffle the Dataframe by adding a random permutation column...
    df[:dfShuffle] = @data(randperm(size(df,1)))
    # ...then sorting on the list of random numbers...
    sort!(ds, cols = [order(:dfShuffle)])
    # ...then deleting the ordering column
    delete!(df, :dfShuffle)

We now have all the pieces we need to read in our sample data, and create a table ready for training a spam classifier!  Here is the code to do it.

if isfile(SPAMDATA)
    println("Loading data set...")
    ds = readtable(SPAMDATA)
    println("Building data set...")
    ds = buildDataSet(SOURCES)
    println("Writing data set...")
    writetable(SPAMDATA, ds)

Note that after first building the data set from the original source files I write it out in tabular form to a CSV file using the writetable() function.  On future runs, if this file exists it can simply be read using the corresponding readtable() function.  This is much faster than rebuilding the data set every time!

III. Building and Training a Spam Classifier

Here is my actual ML function, exCrossValidate(), which takes the complete data table as input, and divides it into a series of ‘folds’ (i.e. training and validation subsets).  For each fold, it uses the training subset to train the ML model (in this case, a multinomial na├»ve Bayes classifier) and then tests it against the validation subset, i.e. by feeding each validation sample into the model and comparing the resulting classification with the actual, known (‘truth’) classification.

function exCrossValidate(ds::DataFrame)
    pipeline = Pipeline([
        ("vectorizer",  CountVectorizer()),
        ("classifier",  MultinomialNB()) ])
    kFold = ScikitLearn.CrossValidation.KFold(size(ds, 1), n_folds = 2)
    scores = Vector{Float64}()
    # This creates a DataFrame with each row containing a count of the number of
    # records with each unique label, then extracts the labels into an array of strings
    classNames = convert(Array, by(ds, [:class], nrow)[:class])
    totalConfusion = zeros(2, 2)
    println("Evaluating model...")
    for (i, (trainIdx, cvIdx)) in enumerate(kFold)
        @printf("    Fold # %d...\n", i)
        trainText, trainClass = ds[:text][trainIdx], ds[:class][trainIdx]
        cvText, cvClass = ds[:text][cvIdx], ds[:class][cvIdx]
        # Force garbage collection if we have just replaced large arrays!
        if i != 1
        ScikitLearn.fit!(pipeline, trainText, trainClass)
        cvPred = ScikitLearn.predict(pipeline, cvText)
        totalConfusion += confusion_matrix(cvClass, cvPred, labels = classNames)
        score = f1_score(cvClass, cvPred, pos_label = SPAM)
        push!(scores, score)
        println(classification_report(cvClass, cvPred, labels = classNames))
    # Make a dataframe from the confusion matrix
    colNames = map(x -> convert(Symbol, x), classNames)
    confusionDf = convert(DataFrame, totalConfusion)
    names!(confusionDf, colNames)
    confusionDf[:truth] = classNames

    @printf("Total emails classified: %d\n", size(ds, 1))
    @printf("Average score: %.6f\n", mean(scores))
    @printf("Confusion matrix:\n")

Below is the result of running the exCrossValidate() function.  The results are, as expected, almost identical with those of the original Python code.  The trained classifier is right about 95% of the time, with a somewhat greater propensity to let spam through (‘false negatives’) than to incorrectly classify legitimate emails as spam (‘false positives’).

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.2-pre+1 (2017-03-06 03:59 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit a6c55c5 (49 days old release-0.5)
|__/                   |  x86_64-linux-gnu

julia> include("spamology.jl")
Loading data set...

julia> exCrossValidate(ds)
Evaluating model...
    Fold # 1...
             precision    recall  f1-score   support

   NOT SPAM       0.87      0.99      0.92     11011
       SPAM       0.99      0.90      0.94     16652

avg / total       0.94      0.94      0.94     27663

    Fold # 2...
             precision    recall  f1-score   support

   NOT SPAM       0.86      0.99      0.92     10827
       SPAM       0.99      0.89      0.94     16836

avg / total       0.94      0.93      0.93     27663

Total emails classified: 55326
Average score: 0.942666
Confusion matrix:
2×3 DataFrames.DataFrame
│ Row │ NOT SPAM │ SPAM    │ truth      │
│ 1   │ 21654.0  │ 184.0   │ "NOT SPAM" │
│ 2   │ 3468.0   │ 30020.0 │ "SPAM"     │


IV. Parallel Potential

In his original Python article, Zac Stewart goes on to look at ways in which the ML model may be improved.  It is worth a read, but I am going to head off in a different direction.  Julia is supposed to be good for parallel computing, and there is an obvious opportunity to parallelise the above code.  Specifically, the cross-validation of each fold is independent, and so computing these sequentially in a single process, on a single core, underutilises the capacity of my multicore processor.  It is also a bit slow, so I wanted to see if I could speed it up with some multi-processing!

I will be covering that in the second part of this series.


Post a Comment