Updated: Sep 14, 2021
In Star Wars Episode IV: A New Hope, Obi-Wan Kenobi begins training Luke Skywalker using a floating droid equipped with tiny lasers. The droid floats through the air and fires at Skywalker, who must defend himself using a lightsaber – blindfolded. He, predictably, fails badly. Obi-Wan Kenobi then encourages him to regain his focus and to use “The Force” to feel the droid’s movements and predict the threat. Only then can Skywalker successfully navigate the exercise.
Those of us who have not yet attained “Jedi Master” status cannot count on ever having the same skill as Luke Skywalker. We have no choice but to take in signals from the tangible world around us, evaluate them, and make decisions. The complexity of the world often makes it feel as though we are blindfolded, though. Statistical techniques have long been useful in pulling back that blindfold. Here we summarize a few of the concepts that may help leaders gain new insights from the data around them – and to help those same leaders perform more like Jedi masters.
For this discussion we will distinguish between two main types of problem: regression and classification. We distinguish the two based on whether we are trying to understand a relationship (how one thing changes relative to another – regression) or whether we are trying to predict whether something belongs in one group or another (how likely it is that Thing A belongs to Group 1 – classification). If you are trying to find statistical relationships, you are likely solving one of these classes of problem.
Regression problems are likely familiar to most people in the form of the basic linear regression, or least squares regression. This technique attempts to fit a line to predict one parameter as a function of another (or several others for multiple linear regression). Think of height and weight.
With some fabricated data (I produced it with a random number generator, varying around some values that made sense), we can easily show a linear regression of height and weight using Microsoft Excel. The regression equation is displayed on the chart and an R^2 value of 0.254 tells us there is some correlation, but it is not very strong.
While regression is certainly an important topic, we are focusing on classification here. Imagine that, instead of trying to determine what weight we can expect based on a specific height, we were trying to determine whether a person of a certain height and weight was likely to shop at a Big & Tall store? This is a classification problem and is likely encountered just as frequently as regression in business situations.
To keep things relatively simple, we will closely examine one common classification algorithm – Logistic Regression (do not let the “regression” in the name fool you). Logistic Regression may be incorporated as the underlying classification algorithm in much more sophisticated machine learning systems – wherein the algorithm is used to continuously improve in some predictive process. To demonstrate its use though, no complex new software build is required.
Your Alien Friend
Imagine you have been visiting an alien planet for a long time. While there you break your ankle and cannot get out of bed. While you’re laying around, your alien friend decides to take a hyperspace trip to Earth to bring a few reminders of home back to you – to set the ambiance as you recover. He is going to build you a garden with tranquil waterfalls, cascading over stone structures. But he does not know how to identify rocks – much less pick them out of a forest as he searches Earth. He would be liable to confuse rocks with all manner of other things on the forest floor – sticks and leaves even. Some parameters that make sense to him, and that he can differentiate, might be an object’s color, density, and hardness. Describing the shape of a leaf or a stick or a rock to him is not much help – he has no context at all. So, you send him on his way to Earth with a few samples that you have identified for him and a mission to collect rocks.
To study this example, we have created a random set of rocks and leaves. When we plot density and hardness (imaginary, arbitrary scales), the two appear to be easily distinguished:
Your alien friend will have a pretty easy time telling the difference if rocks and leaves are the only things he encounters. If we add sticks to the mix, it gets more difficult for him:
There is a range of density and hardness where your alien friend will be unsure whether he is looking at a leaf, a stick, or a rock. Evaluating each object systematically will need a robust approach. Since your friend is not able to tell the difference himself, he needs a classifier model if he is to succeed in bringing back rocks.
Logistic Regression is a commonly used classifier model. Though the name includes “regression” it is most often used as a means of determining the likelihood that a set of inputs corresponds to a certain class of output. In your alien friend’s example, logistic regression can help us determine the likelihood that an object (with a certain density and hardness) is actually a rock. Logistic regression is still a regression in the sense that it models a relationship between the characteristics of interest (density, hardness) and a certain result (is a rock). The regression achieves this by using the logistic function, which models the probability of the object being a rock. For the purposes of this discussion, we are leaving out the math and focusing on the concepts, but our reference at the end (as well as innumerable free online resources) can give you all the math details you may need.
Training Data and Test Data
So, if your alien friend has collected thousands of objects and he has some samples from you (for which you have told him whether the thing is a rock, a stick, or a leaf) what does he do? First, he selects your samples and uses an open source software package to build a regression model – the model gets the density and hardness on one side (inputs) and adjusts regression coefficients to get the best possible match to the pre-determined classification you have provided (the response variable). Your alien friend has used your sample as a training data set. In our example model, we have used one half of the randomly generated data as a training set. Below are some results from our training data, in a model where we demanded that the probability of something being a rock be at least 0.80 before we actually call it a rock.
The chart above is called a confusion matrix for the training data set. Do not allow the confusion matrix to confuse you. It is actually a simple means of assessing whether the model performs well. The values in each quadrant of the confusion matrix have been color-coded to show what the interpretation of each is. Here we see that the model is not giving us very good results. The high probability (0.80) tends to make the model default to “Not a Rock” for almost everything. How can we improve the model? Lowering the probability does not actually help. Retraining the model with a probability of 0.50 gives us results that make it appear the model is merely guessing at random.
Apparently, density and hardness are not sufficiently different to allow them to act as reliable predictors for whether an object is a rock. We have three classes mixed into this data set and the sticks have an intermediate range of both density and hardness, making it hard to differentiate anything. Maybe there is another characteristic our alien friend will be able to recognize. It turns out that we can use categorical variables in logistic regression also. The density and hardness of the object are continuous variables (the object could take on any value), but the object’s shape or color is a categorical variable (takes on discrete values and is not represented by a number). For this example, our rocks, sticks, and leaves took on colors randomly according to the lists below (it must be Autumn).
Including color in the model improves performance drastically.
Then, to determine whether we really have a good model, we run the classifier on the Test data set (the other half of our sample for which we know the real classification). For this model, the same performance is repeated in the test data set, demonstrating that the model is a good classifier. The confusion matrix for the Test dataset is shown below, with similar concentrations in the True Positive and True Negative quadrants.
Beyond the visual sense of model performance we get from the confusion matrix, there are other means of quantifying the model’s ability to accurately classify (often using ratios of numbers appearing in the confusion matrix), but for the sake of brevity we will save those for another day. By providing the hardness, density, and color for a group of things we know and understand, we have built a model that can tell us, pretty reliably, whether an unknown object is the thing we are looking for (rocks!) and it did not take any special software product to do it.
If you have read this far then hopefully you have gained an appreciation for:
1. What a classifier model is and how it differs from what is likely your most common prior experience with statistical modeling, linear regression
2. What logistic regression can achieve in terms of classification, and special features that make it very useful (like the ability to handle categorical variables, and results which are easily interpreted as probabilities)
3. How to interpret outputs of logistic regression (i.e., the confusion matrix) and an appreciation for how the inclusion of more input data (in our example, object color) can improve the performance evident in those outputs
If you have not already thought ahead to business cases that could benefit from classifier models like this one, here are some examples of potential opportunities:
In mineral resources, predicting whether a certain region or geologic formation will be stable or will contain a certain type of resource
In marketing, predicting whether a customer population in one brand/market would also be attracted to another brand/market
In human resources, predicting whether an employee belongs to a class of particular interest, for instance, those who are expected to attrite in the near term
With a little thought, you can probably come up with places in your business that might benefit from the application of a classifier model. This Ox Road Observation was titled Classification Models – Enough to be Effective (instead of Dangerous, like the common figure of speech) because the most effective classification models may not require extraordinary complexity (or cost), but can still be effective. Though it takes a degree of familiarity with the algorithms and the mathematics behind them, a full-blown data science team working around the clock is not required for most problems. And, admittedly, by producing synthetic data here we have ensured that the pictures are clear and the model performs well...but that helps to illustrate the point, so we hope you will forgive us.
There are other classifier algorithms worth exploring, and many other topics of interest in statistical learning. I recommend An Introduction to Statistical Learning with Applications in R, by James, Witten, Hastie, and Tibshirani (ISBN 978-1461471370) for anyone who wants to dig into these concepts. It is a relatively simple way to get acquainted with statistical learning. And though you may not be capable of producing a classifier model yourself, this introduction may give you enough confidence to understand where these techniques can give you an advantage, and how to recognize where you need help.