AlexNet was designed by Sir Geoffrey Hinton and his student, they won the 2012 ImageNet competition, It was the first architecture after LeNet which brings the revolution in Deep Learning industry. It achieved a top-5 error of 15.3% in ImageNet Challenge. This was 10.8% lower than that of runner up.
AlexNet Architecture:
AlexNet consist of eight layers: Five Convolutional layers, Three Max-Pooling layers, Two Normalization Layers, 2 Fully Connected layers and One Softmax layer.
Large Kernels are used in Alex-Net like 96 Kernels are used in first Convolutional layer which extract the important features from the image. Then next two Convolutional layers are followed by Overlapping Max Pooling layers. The Kernel size of the Convolutional layers increased till third Convolutional layer i.e 256, 384 and 384 these large number of kernels are used to extract lots of features from the image. The fourth and fifth Convolutional layers of kernel size of 384, 384 are connected directly. The fifth Convolutional layer is followed by Overlapping Max Pooling layer. The output of Pooling layer goes to the first Fully Connected layer. The second Fully Connected layer goes to softmax classifier with 1000 class labels.
Alex-Net was the first Architecture in which ‘ReLu’ activation, GPU, Image Augmentation and Local Response Normalization was used.
Features of Alex-Net
- ReLU
ReLU based Convolutional Networks trained 6 times faster then sigmoid or tanh based networks.
2. Overlapping Pooling Layer
Overfitting is not defined over a particular size of dataset. If we have large datasets such as the Imagenet, then overfitting could very well be applicable for a large portion of it or a small portion of it. It is simply a phenomenon where a CNN might not learn to extract rich features (more generalizable model), but rather tends to extract features that are only good for classifying certain number of examples in the training set. When we have non-overlapping pooling regions, we can see that the spatial information is quickly lost and the network “sees” only the dominant pixel values (winning unit for max pooling for example). But with overlapping regions, there is less loss of surrounding spatial information. Overlapping Max Pool layers are similar to the Max Pool layers, except the adjacent windows over which the max is computed overlap each other.
3. Local Response Normalization
Since ReLU activation function was used so that value after the activation function has no range like sigmoid or tanh so a normalization will usually be done after ReLU.
4. Dropout Layer
Since Alex-Net had 60 million parameters, so it has a major issue in terms of overfitting. To overcome this issue dropout was used. Dropout is a regularization method that dropped a neuron on probability basis from the Neural Network. When a neuron is dropped, it does not contribute to forward propagation or backward propagation.
5. Data Augmentation
Data augmentation is a technique to artificially create new training data from existing data. It helps to increase the size of the dataset and introduce variability in the dataset, without actually collecting new data. The neural network treats these images as distinct images anyway. Data augmentation can be used to address the requirements like diversity in training data and augmented data can also be used to address the class imbalance problem in classification tasks. Also Data Augmentation helps reduce over-fitting.
Implementation with Keras
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, Flatten,\
Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
import numpy as np
np.random.seed(1000)
# Get Data
import tflearn.datasets.oxflower17 as oxflower17
x, y = oxflower17.load_data(one_hot=True)
# Create a sequential model
model = Sequential()
# 1st Convolutional Layer
model.add(Conv2D(filters=96, input_shape=(224,224,3), kernel_size=(11,11),\
strides=(4,4), padding=’valid’))
model.add(Activation(‘relu’))
# Pooling
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding=’valid’))
# Batch Normalisation before passing it to the next layer
model.add(BatchNormalization())
# 2nd Convolutional Layer
model.add(Conv2D(filters=256, kernel_size=(11,11), strides=(1,1), padding=’valid’))
model.add(Activation(‘relu’))
# Pooling
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding=’valid’))
# Batch Normalisation
model.add(BatchNormalization())
# 3rd Convolutional Layer
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding=’valid’))
model.add(Activation(‘relu’))
# Batch Normalisation
model.add(BatchNormalization())
# 4th Convolutional Layer
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding=’valid’))
model.add(Activation(‘relu’))
# Batch Normalisation
model.add(BatchNormalization())
# 5th Convolutional Layer
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding=’valid’))
model.add(Activation(‘relu’))
# Pooling
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding=’valid’))
# Batch Normalisation
model.add(BatchNormalization())
# Passing it to a dense layer
model.add(Flatten())
# 1st Dense Layer
model.add(Dense(4096, input_shape=(224*224*3,)))
model.add(Activation(‘relu’))
# Add Dropout to prevent overfitting
model.add(Dropout(0.4))
# Batch Normalisation
model.add(BatchNormalization())
# 2nd Dense Layer
model.add(Dense(4096))
model.add(Activation(‘relu’))
# Add Dropout
model.add(Dropout(0.4))
# Batch Normalization
model.add(BatchNormalization())
# Output Layer
model.add(Dense(17))
model.add(Activation(‘softmax’))
model.summary()
# Compile
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’,\
metrics=[‘accuracy’])
# Train
model.fit(x, y, batch_size=64, epochs=30, verbose=1, \
validation_split=0.2, shuffle=True)