When applying a new model to perform some form of predictive modelling, its often useful to gain some intuition around how it works.
A great way to do this is by visualising the model’s decision boundaries.
Starting with an absurd example of a binary classification problem with two covariates, where we want to classify points as belonging to the inner or outer ring of the chart below.
A simple way of doing this is to train our model using the above data and then make predictions over a grid of points that span the range of the independent variables. By colour coding the predicted class we can get a rough idea of the classification boundary of our chosen model.
Our first model is
glm which is a generalised linear model. We are using the
caret package here which will use the binomial family when it detects a categorical dependent variable, so essentially it’s a logistic regression.
We can see it has fit a linear decision boundary which is what we would expect given the predictors are all linear. Obviously it’s not well suited to this problem.
CART Decision Tree
Next we try
rpart which implements the
CART model, which is a basic decision tree. The aim of tree based models is to partition the data into homogeneous groups using classification ‘rules’ or ‘splits’. It is clear from the above chart the model has created a set of if-then rules to isolate the inner ring data points.
If we relax the complexity parameter for this model we can force the decision tree to keep creating more and more splits even after the splitting procedure has converged.
Random Forest models are a type of ensemble model that uses a concept called bagging. This means it fits a number of decision trees using different predictors and aggregates the results using a type of voting system. As a result, instead of a crisp decision boundary we get a tighter, more jagged boundary, which matches out intuition.
A key concept here is the trade-off between bias and variance. A more complex model may be more ‘accurate’ but will have typically higher variance. This makes it vulnerable to ‘over-fitting’, meaning it may not generalise well to new data and is too typical of the specific training data used. On the other hand, a basic model may have less variance and be more generalisable (like the CART model) but will have more bias as it’s over-simplification will make it less precise.
Using these visualisations is a good way to understand a model, but it’s easy to get distracted by how well it fits the training data and get sucked into the over-fitting trap.
- The source code for this post is available from my github
- Kuhn & Johnson (2013), Applied Predictive Modelling, Springer