How to train neural networks for image classification — Part 1
Image classification is a hot topic in data science with the past few years seeing huge improvements in many areas. It has a lot of applications everywhere, but how is this done?.
A deep neural network is a network of artificial neurons organised into layers (via software). With each layer connected to the next and each connection having a weight that helps determine how much the artificial neuron fires. This firing helps determine how strong the connections are between layers, and in general neurons that fire together have stronger connections. Just like with biological neurons.
How strong these connections are is determined by how the network is trained on the data you put into it. These networks are trained via a process called backpropagation which works to feed data into the network and then measures the network’s performance. This error is measured using a loss function.
Backpropagation works by using gradient descent to measure the rate-of-change of the loss function with respect to the weighting of each connection, and the gradient descent step is used to make sure the error rate for each connection is reduced as close to zero as possible. The network should eventually converge on a solution where the overall error is minimised.
The rate at which the network learns is called the learning rate and this is another hyperparameter that can be tuned when training neural networks. If the learning rate is too small, the network can take too long to converge on a solution, and conversely, if the learning rate is too large then the network will ‘bounce around’ and never really converge on an optimum solution.
There are different types of layers in neural networks, and each one transforms data differently. The most basic type is a dense layer, which is where all neurons are connected. Other types include convolutional layers which are primarily used for image processing tasks and recurrent layers which are used to process time-series data. There are others, but these are the most common types.
In this article, I will be focusing on how to implement a simple image classifier using a series of dense layers in Python using Keras as part of Tensorflow. As mentioned above, convolutional neural networks usually work better for image classification tasks and I will talk about these in part 2 of this series. As my primary area of interest is Search Engine Optimisation, I will tie all of this together in part 3 for how neural networks are used in search.
Neural networks are fascinating, and if you have an interest in this topic, I would encourage you to check out this excellent playlist on YouTube on a channel called Deep Lizard. They have excellent and accessible content on neural networks and deep learning.
If you are more interested in the implementation using Python with Keras, I would encourage you to look at Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow by Aurelion Geron. It is an excellent book, written by a former Googler and reviewed by the author of Keras. I highly recommend it.
Getting started with Keras and TensorFlow
Keras is a high-level deep learning API in Python that allows you to easily create and train deep learning models. It was launched in 2016 and has gained traction within the data science community due to its ease-of-use as its syntax is designed to be very similar to Scikit-Learn.
TensorFlow is a library created by the Google Brain team for machine learning tasks and has often competed with PyTorch (created by Facebook) for market share, but has not been able to do as well as its documentation wasn’t as accessible. To remedy this, Google released TensorFlow 2 which contained many improvements particularly around cross-compatibility with models, GPU support and graphing utilities.
With the release of TensorFlow 2, Keras is now merged into Tensorflow and the standalone version is no longer maintained. If you want to install Keras and TensorFlow then it is straightforward. Just go to your Python environment (I recommend using a virtual environment/package manager like pipenv) and use whichever Python package manager you like:
python -m pip install TensorFlow
Be sure to use a 64-bit version of Python for this to work. To get GPU support on Windows is a bit of a faff, but this YouTube video will show you how to get it done. However, you probably won’t need it for this tutorial. You could look at Google Colab as that has GPUs available with no setup.
You can check the installation in a Jupyter notebook with the following:
import TensorFlow as tf
from tensorflow import keras
print(tf.__version__)
print(keras.__version__)
If this has all worked, then at the time of writing you should see something like:
2.3.0
2.4.0
When you’ve got Keras and TensorFlow working, you should be good to go on building an image classifier with a neural network.
Importing the data set
For most simple image classification tasks, it is popular to use the MNIST data set, which consists of 60,000 photos of handwritten numbers. However, for this task, we are going to use the MNIST Fashion dataset, which consists of 60,000 28 x 28 grayscale images of Zalando article fashion images, all classified across 10 different classes. The reason for this is that image classifiers tend to find this more challenging.
Keras has utility functions to help import this dataset, so it is fairly straightforward to use (similar to Sklearn). Work in a Jupyter notebook, and begin by making sure we have all the imports we need:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
We will be working with NumPy arrays and plotting this with Matplotlib, so you will need to ensure these are accessible within your environment. Once this is done, you can import the dataset:
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()
The dataset is split between a training and a test set automatically (60,000 images in training, 10,000 images in test). The x-axis data are the images and the y-axis data are the labels. To make this more useful for working with, it is also a good idea to create a validation data set so we can ensure the model isn’t overfitting:
X_valid, X_train = X_train_full[:5000] X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
The y-axis data is just a series of numbers associated with each class label, therefore we need to define the class labels manually:
class_names = [ “T-shirt/top” , “Trouser” , “Pullover” , “Dress” , “Coat” , “Sandal” , “Shirt” , “Sneaker” , “Bag” , “Ankle boot” ]
To get an idea of what the dataset actually represents we can use a simple loop and Matplotlib:
plt.figure(figsize=(10,10))
for i in range(25):
plt.subplot(5,5,i+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(X_train_full[i], cmap=plt.cm.binary)
plt.xlabel(class_names[y_train_full[i]])
plt.show()
From this, you will see something like this:
As you can, while there are only 10 classes (similar to the MNIST dataset), the images are different for each class, which is why it is a more challenging dataset to work with.
Normalizing the dataset
The first step to working with neural networks is to normalize the dataset, otherwise, it could take a lot longer for the network to converge on a solution.
The usual way of normalizing a dataset is to scale the features, and this is done by substacting the mean from each feature and dividing by the standard deviation. This will put the features on the same scale somewhere between 0 — 1.
As we are working with 28 x 28 NumPy arrays representing each image and each pixel in the array has an intensity somewhere between 1 — 255, a simpler way of getting all of these images on a scale between 0–1 is to divide each array by 255.
X_valid, X_train = X_valid / 255., X_train / 255.
X_test = X_test / 255.
From there, we are good to go to build a dense layer neural network and train it on our data set.
Building the neural network image classifier
In order to build the model, we have to specify its structure using Keras’ syntax. As mentioned above, it is very similar to Scikit-Learn and so it should be recognisable if you are familiar with that package. The code for building the model is as follows:
model = keras.models.Sequential([keras.layers.Flatten(input_shape = [28, 28]),
keras.layers.Dense(300, activation = “relu” ),
keras.layers.Dense(100, activation = “relu” ),
keras.layers.Dense(100, activation = “relu” ),
keras.layers.Dense(100, activation = “relu” ),
keras.layers.Dense(10, activation = “softmax” )])
To explain this code:
- The first line creates a Sequential model and this is the simplest type of data structure in Keras and is basically a sequence of connected layers in a network
- The first layer in the model is a flatten layer and is there for pre-processing of the data and it isn’t trainable itself. What this does is take each 28 x 28 NumPy array for each image and flattens it into a 1 x 784 array that the network can work with
- Next, we add a Dense hidden layer with 300 neurons. It will use the ReLU activation function. Each Dense layer manages its own weight matrix, containing all the connection weights between the neurons and their inputs
- Next, we add another 3 Dense layers with 100 neurons each. There are diminishing returns to adding new layers and this is something we need to test as we build and optimise the network
- Finally, we add a Dense layer with 10 neurons as there are 10 classes to predict and as they are all exclusive, we use the softmax activation function
To get a full understanding of the model’s structure we can use:
model.summary()
And this will give us an output of the full structure of the network:
Model: “sequential_2” _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= flatten_2 (Flatten) (None, 784) 0 _________________________________________________________________ dense_10 (Dense) (None, 300) 235500 _________________________________________________________________ dense_11 (Dense) (None, 100) 30100 _________________________________________________________________ dense_12 (Dense) (None, 100) 10100 _________________________________________________________________ dense_13 (Dense) (None, 100) 10100 _________________________________________________________________ dense_14 (Dense) (None, 10) 1010 ================================================================= Total params: 286,810 Trainable params: 286,810 Non-trainable params: 0 _________________________________________________________________
As can be seen this network has a total of 286,810 trainable parameters (consisting of weights between neurons and bias terms) and this gives the network a lot of flexibility, but it also means that it will be very easy for it to overfit so we need to becareful.
Before we can train the network we need to compile it, and this is done with the following code:
model.compile(loss = “sparse_categorical_crossentropy”,
optimizer = “sgd”,
metrics = [“accuracy”])
In this line we are specifying 3 things:
i) The loss function to use. In this case we are using sparse categorical cross entropy — this is because we have an index of exclusive (sparse) labels we are trying to predict against
ii) The optimizer we are going to use to optimise the model against the loss function is stochastic gradient descent and this will ensure the model converges on an optimum solution i.e. Keras will use the backpropagation method described above
iii) Finally, we specify a metric that we are going to use in addition to loss to give us an idea of how well our model is working. In this case, we are using accuracy which gives an idea of how well our model is doing by giving a percentage of how many predictions match the actual class for the model we are training
Training the network
Training the network is easy once it has been compiled. All you need to do is call the model’s fit method (like Sklearn) as follows:
history = model.fit(X_train,
y_train,
epochs = 10,
validation_data = (X_valid, y_valid))
We initially pass in the data that we want to train the network on, in this case, X_train are the images and y_train is an array containing the labels. We also specify the number of epochs we want to train the model with (an epoch being defined as how many times we want to pass the training data through the network for training purposes).
Keras also lets us specify an optional validation_data argument where we pass in a validation data set. If we do this, then at the end of each epoch Keras will test the performance of the network on the validation data set. This is a good way of ensuring the model isn’t overfitting, however, it doesn’t feed into the training itself.
As training proceeds, you will see something like this:
Epoch 1/10 1719/1719 [==============================] — 5s 3ms/step — loss: 0.7698 — accuracy: 0.7385 — val_loss: 0.5738 — val_accuracy: 0.7962
Epoch 2/10 1719/1719 [==============================] — 5s 3ms/step — loss: 0.4830 — accuracy: 0.8283 — val_loss: 0.4570 — val_accuracy: 0.8404
Epoch 3/10 1719/1719 [==============================] — 5s 3ms/step — loss: 0.4261 — accuracy: 0.8480 — val_loss: 0.4121 — val_accuracy: 0.8522
Epoch 4/10 1719/1719 [==============================] — 5s 3ms/step — loss: 0.3932 — accuracy: 0.8582 — val_loss: 0.3951 — val_accuracy: 0.8566
Epoch 5/10 1719/1719 [==============================] — 5s 3ms/step — loss: 0.3708 — accuracy: 0.8660 — val_loss: 0.3597 — val_accuracy: 0.8682
Epoch 6/10 1719/1719 [==============================] — 5s 3ms/step — loss: 0.3518 — accuracy: 0.8728 — val_loss: 0.3397 — val_accuracy: 0.8756
Epoch 7/10 1719/1719 [==============================] — 5s 3ms/step — loss: 0.3369 — accuracy: 0.8779 — val_loss: 0.3506 — val_accuracy: 0.8738
Epoch 8/10 1719/1719 [==============================] — 5s 3ms/step — loss: 0.3243 — accuracy: 0.8814 — val_loss: 0.3343 — val_accuracy: 0.8774
Epoch 9/10 1719/1719 [==============================] — 4s 3ms/step — loss: 0.3128 — accuracy: 0.8861 — val_loss: 0.3415 — val_accuracy: 0.8794
Epoch 10/10 1719/1719 [==============================] — 4s 2ms/step — loss: 0.3019 — accuracy: 0.8888 — val_loss: 0.3265 — val_accuracy: 0.8822
This will continue for as long as the training is happening, there are accuracy and loss metrics for both the training and validation data sets. The value of the accuracy is a simple percentage measure of how many items the network got right. The value of loss is the cross entropy loss.
Once the model is trained, it is possible to call its history method to get a dictionary of the loss and any other metrics needed at every stage of the training. We can put these in a Pandas DataFrame and plot them as follows:
pd.DataFrame(history.history).plot(figsize = (16, 10))
plt.grid(True)
plt.gca().set_ylim(0, 1)
plt.show()
As can be seen above as the loss decreases, the accuracy increases. Two other things stand out from this plot:
- We could probably train this model longer as it doesn’t look like the loss has reached a minimum
- The accuracy of the training data set is higher than it is for the validation set (which is normal) but not wildly different to the validation dataset. This means there is no overfitting
Evaluating the performance of the neural network
Evaluating the performance of the network is straight forward and follows data science best practice principles. We call the model’s evalute method on the test data set to see how it performs. Remember that the test data set hasn’t been used in training and the network hasn’t seen this data before. We should perform this step only once so we can get an accurate idea of the model’s performance.
model.evalute(X_test, y_test)
This will run the model on the test data set and the output should look something like this:
313/313 [==============================] — 0s 2ms/step — loss: 0.3802 — accuracy: 0.8858
You’ll get an output of the loss and whatever other metrics specified when the model was compiled. Here we can see that this model is correct 88% of the time which isn’t bad for a simple network on such a difficult data set.
Next steps
In the next part of this series, I will talk about how to implement the above using a convolutional neural network and show how and why these perform better for image classification tasks.
You can get the code I’ve used for this work from my Github here. Thanks for reading.