What is a Convolutional Neural Network?
CNNs are specialized neural networks designed for processing structured data, particularly
images
. Unlike traditional neural networks, CNNs excel in extracting spatial hierarchies and features from input data.Key Characteristics of CNNs:
- Parameter Sharing: Reduces the number of trainable parameters.
- Spatial Invariance: Captures patterns like edges, shapes, and textures across different parts of an image.
Architecture of CNNs
The architecture of a CNN typically comprises the following components:
- Input Layer
- Convolutional Layer
- Pooling Layer
- Fully Connected Layer
- Output Layer
1. Input Layer
The input layer accepts the raw image data. An image can be represented as a 3D tensor of shape , where:
- : Height of the image.
- : Width of the image.
- : Number of color channels (3 for RGB images).
For example, an RGB image of pixels has dimensions .
2. Convolutional Layer
The convolutional layer applies convolution operations to extract
features
from the input image using filters
.Filter (also called a Kernel) is a small matrix used to extract specific features from the input data, such as edges, textures, or patterns by sliding across the input image and performing element-wise multiplications.
Features represent patterns or characteristics that a model extracts from input data to make predictions. These features vary in complexity, depending on the depth of the network:
- Low-Level Features: Early layers in the network detect basic elements like edges, corners, or simple textures. These are fundamental building blocks in images.
- Mid-Level Features: Middle layers identify more complex patterns such as shapes or repeated textures.
- High-Level Features: Deeper layers capture abstract concepts like specific objects or scenes.
Mathematical Formulation
The convolution operation involves sliding a
filter
over the input image and computing the dot product. Mathematically:Where:
- : Input pixel value.
- : Filter weights.
- : Bias term.
- : Filter size.
Example
Consider a filter:
Sliding this filter over a image computes edge detection in the horizontal direction.
Padding and Stride
- Padding: Adds a border of zeros to preserve spatial dimensions.
- Without padding, the output size is reduced.
- Formula for output size:
- : Input size.
- : Kernel size.
- : Padding.
- : Stride.
Where:
- Stride: Determines how far the filter moves at each step.
Example of Convolution:
3. Activation Functions
After convolution, activation functions introduce non-linearity to the model. The most common activation function is ReLU (Rectified Linear Unit):
ReLU is computationally efficient and mitigates the vanishing gradient problem.
Other Activation Functions:
- Sigmoid:
- Tanh:
4. Pooling Layer
The pooling layer is used for
downsampling
feature maps. It reduces the spatial dimensions of the input while preserving important features. This makes the network more efficient by decreasing the number of parameters and computations.Functions of the Pooling Layer:
- Dimensionality Reduction: Reduces the size of feature maps, lowering computational cost and memory usage.
- Feature Highlighting: Retains dominant features (e.g., edges, textures) while discarding less significant information.
Types of Pooling:
- Max Pooling:
- Selects the maximum value from each region covered by the pooling window.
- Example:
- Input:
- With a window and stride 2:
- Max values:
- Output:
- Average Pooling:
- Computes the average value of the region covered by the pooling window.
- Example:
- Input (same as above):
- Averages:
- Output:
Max Pooling vs Average Pooling
- Max Pooling is preferred for tasks where detecting the strongest activation is important (e.g., object detection).
- Average Pooling is often used for smoothing or when precise averaging is required, such as in global average pooling for classification tasks.
Example of Max Pooling:
5. Fully Connected Layer
Fully connected layers flatten the feature maps into a vector and pass it through dense layers. These layers perform the final classification using a softmax activation function:
Backpropagation in CNNs
CNNs are trained using backpropagation, which minimizes a loss function like cross-entropy. The key steps are:
- Forward Pass: Compute the output of the network.
- Loss Calculation: Measure the error using a loss function.
- Backward Pass: Update weights using gradients computed via the chain rule. Where:
- : Loss.
- : Learning rate.
A CNN architecture of Image Classification task:
1. Input Layer
- Accepts input images of size .
2. Convolutional Layer
- Apply 32 filters of size to detect low-level features like edges and textures.
- Use ReLU activation to introduce non-linearity.
3. Pooling Layer
- Perform max pooling with a window to reduce the spatial dimensions by half () while retaining the dominant features.
4. Additional Convolutional and Pooling Layers
- Add another convolutional layer with 64 filters of size , followed by a max pooling layer. This extracts more complex features and further reduces dimensions ().
5. Flattening Layer
- Convert the feature maps into a 1D vector with 3,136 features. This step prepares the data for dense layers.
6. Fully Connected Layer
- A dense layer with 128 neurons connects all the extracted features and learns high-level patterns.
7. Output Layer
- Use a dense layer with 10 neurons (one for each digit) and a softmax activation function to output probabilities for each class.