Even better! At first, this seems like a major triumph, but there is actually a hidden danger here. Note that we are training the line to fit X and y and then asking it to predict X again! This is actually a huge mistake in ML because it can lead to overfitting, which means that our model is merely memorizing the data and regurgitating it back to us.
Imagine that you are a student, and you walk into the first day of class and the teacher says that the final exam is very difficult in this class. In order to prepare you, she gives you practice test after practice test after practice test. The day of the final exam arrives, and you are shocked to find out that every question on the exam is exactly the same as in the practice test! Luckily, you did them so many times that you remember the answer and get 100% on the exam.
The same thing applies here, more or less. By fitting and predicting on the same data, the model is memorizing the data and getting better at it. A great way to combat this overfitting problem is to use the train/test approach to fit ML models, which works as illustrated next.
Essentially, we will take the following steps:
- Split up the dataset into two parts: a training and a test set.
- Fit our model on the training set and then test it on the test set, just like in school, where the teacher would teach from one set of notes and then test us on different (but similar) questions.
- Once our model is good enough (based on our metrics), we turn our model’s attention toward the entire dataset.
- Our model awaits new data previously unseen by anyone.
This can be visualized in Figure 10.10:

Figure 10.10 – Splitting our data up into a training and testing set helps us properly evaluate our model’s ability to predict unseen data
The goal here is to minimize the out-of-sample errors of our model, which are errors our model has on data that it has never seen before. This is important because the main idea (usually) of a supervised model is to predict outcomes for new data. If our model is unable to generalize from our training data and use that to predict unseen cases, then our model isn’t very good.
The preceding diagram outlines a simple way of ensuring that our model can effectively ingest the training data and use it to predict data points that the model itself has never seen. Of course, as data scientists, we know that the test set also has answers attached to it, but the model doesn’t know that.