How good is my model?

4 minute read


Binary classification models are very common. The outcome is typically to classify records into two groups e.g. Sick/Healthy, Will Buy/Wont Buy, Hot dog/Not hot dog.

This seems like such a simple scheme, but communicating the performance of such a model to a client is really hard to do.

The reason being, there are two types of mistakes your model can make:

  1. It can predict a record as positive when it isn’t.
  2. It can fail to predict a record as positive when it is.

This has messy consequences. Using a basic example below I will try to develop this intuition and explain the basic terminology used like sensitivity, specificity, precision and recall.

Example: Airport Screening

You are the agent in charge of screening passengers through an airport security checkpoint of a very dangerous country.

The detector will try to predict if a passenger is carrying something prohibited, like a weapon.

Let’s say we know that 5% of all passengers are carrying a weapon.

A good scanner would detect all passengers carrying weapons and let all unarmed passengers through freely.

Now we have a two-way table that compares what actually happened with what our detector has predicted. In this case our model is completely accurate and doesn’t make any mistakes.

 Has WeaponNo Weapon
Predicted Weapon50
Predicted No Weapon095

However, in reality models aren’t perfect.

Type 1 Error - False Positives

This shows what happens when the scanner beeps for an innocent passenger. Maybe they just have keys in their pocket.

 Has WeaponNo Weapon
Predicted Weapon53
Predicted No Weapon092

Type 2 Error - False Negatives

This is when the detector doesn’t beep and the passenger with a gun sneaks through. Maybe it’s hidden well or is made of a unique material and fools the detector.

 Has WeaponNo Weapon
Predicted Weapon40
Predicted No Weapon195

The complete picture

 Has WeaponNo Weapon
Predicted Weapon43
Predicted No Weapon192

How correct is the model?

Sensitivity (Recall)

This is the ability to correctly detect passengers who are carrying a weapon.

We have 5 passengers who are carrying a weapon, and we detected 4 of them, but 1 slipped through.

Sensitivity = 4/5 = 0.8


This is the ability to correctly pass passengers who are not carrying a weapon.

We have 95 passengers who are not carrying a weapon. We let 92 go through, but stopped 3 innocent passengers to double check.

Specificity = 92/95 = 0.968


We can combine the number of true positives (4) and the number of true negatives (92) to provide a view of how good our model is at being right.

Accuracy = (4 + 92) / 100 = 0.96

Positive Predictive Value (Precision)

This is the proportion of passengers we stopped who did have a weapon.

PPV = 4 / 7 = 0.57


Dont look at numbers in isolation

Hang on, we have high specificity, sensitivity and very good accuracy, but when the detector beeps and we stop people, we are only right 57% of the time, why?

If the thing you are predicting is rare, you will have lots of negative cases. Even a very low false positive rate over a large number of cases will result in a lot of actual false positives.

So even if you are good at predicting the positive case, the relative number of true positives to false positives may still be rubbish.

In a medical scenario, this is common with screening tests. A negative result may reassure a patient they don’t have a disease, but a positive result can’t be relied on, but rather is used to conduct a more thorough follow up test that may be invasive or too expensive to apply to everyone.

In our airport scenario, this is the pat down and wand technique. Too annoying and time consuming to do to everyone, but the detector is not predictive enough to just blindly arrest everyone who makes the machine beep.

Trade Offs

Its also important to look at the trade-offs that suit your use-case. In the airport example, its worth accepting a higher false positive rate and inconveniencing more people than you need to in order to be extra-sure you are not missing anyone carrying a gun on a plane. This increases recall but reduces the precision.

If you are a fishing trawler, it would be more responsible to use smaller nets and accept a smaller catch of fish to decrease the chance of accidentally netting a dolphin. Here you accept a lower recall for higher precision.