Building a Convolutional Neural Network from Scratch
I thought it'd be fun to do this so here we are.
Process
Convolution Operation
Figuring out the convolution was quite simple and just involved adapting the valid cross-correlation formula which is sliding the kernel on top of the input, multiplying the adjacent values, and summing them up. To get the convolution, it involved rotating the kernel 180 degrees. Thus, finding the convolution looked like *
Implementing the Convolutional Layer
Forward Kernel
Now the crux of the CNN—the convolutional layer. This involved taking in a 3 dimensional block of data as inputs (depth being 3). The kernels are also a 3 dimensional block in this case, spanning the full depth of the input. Something neat is that we can have multiple kernels, all of which extend the depth of the input. Each kernel contains a bias matrix that has the same shape as the output. Then, the layer would produce a 3 dimensional block of data as the output. Computing the output involved taking the cross-correlation with the input data and summing this up with the bias. That process is repeated with each kernel. We use the following formula for calculating the outputs: We can repeat the use of this equation for every kernel, simply by using a different kernel and bias matrix. The inputs, , would stay the same. This is called the forward propagation of the convolutional layer.
Using sum notation, we can write it like this:
Backward Kernel
To update the kernels and biases, we need to compute their gradients. We're given the derivative of E, , and we need to compute two things. First, the derivative of E with respect to the trainable parameters of the layer, .
Second, the derivative of E with respect to the input of the layer, .
Once you find these, we use the backward method by initializing empty arrays for the kernel and input gradients. We compute the derivative of E with respect to k, i, and j inside two nested for loops that go through the indices k, i, and j. We do this by simply translating the mathematical formula into code (with the help of SciPy's signal class).
[!NOTE] To be honest, this is where I got really lost and I definitely will be revisiting this later to better understand what's going on here. This is also the core element of computer vision algorithms that are using deep learning today so it's pretty important to understand this part.
Implementing the Reshape Layer
So this layer is inherited from the base layer class. The class looks something like this:
The constructor takes in the shape of the input and output. The forward method reshapes the input to the output shape. The backward method reshapes the output to the input shape. Not too much going on here.
Binary Cross-Entropy Loss
We're given a vector, , containing the desired outputs of the neural network. Keep in mind (hence the term binary).
We also have the actual output of the neural network, .
The binary cross entropy loss is defined as the following:
The goal is to compute the derivative of E with respect to the output. Upon plugging into , we find that only appears in the first term.
Thus, we can just use the chain rule and we get the following:
Also, I added a small epsilon value that prevents log(0) and division by 0. After converting this to code, it looks something like this:
Sigmoid Activation
The sigmoid activation takes any real number and outputs a value between 0 and 1. This is particularly useful for binary classification problems where the output is interpreted as a probability. The sigmoid activation is defined as:
The derivative is:
And the implementation looks like this:
Solving MNIST
MNIST is a dataset of handwritten digits (0-9). The goal of this CNN is to classify each of these images into a number. We load the MNIST dataset from the keras library like so:
First, we get the indices of images representing a zero or one. Then, we stack the arrays of numbers together and shuffle them. Then, we extract only these images from the indices. Then, we reshape each image from 28x28 pixels to a 3D block of 1x28x28 pixels. This is because our convolutional layer takes in a 3D block of data with the depth as the first dimension. The images contain numbers from 0 to 255, we normalize the input by dividing each input by 255. For the output vector, we use another util from keras called to_categorical which will create a one-hot encoded vector from a number. Essentially, . Then we use reshape because the dense layer takes in this type of input.
Finally, our network looks something like this:
We then define our epochs and learning rate. I used values 20 and 0.1 respectively.
Now for training, it looks quite similar to building a regular neural network except we are using the binary cross entropy loss in this.
Running It
Acknowledgements
This was super fun to build and I learned a lot. Thanks to The Independent Code and his extremely informative video which I followed and adapted.