Deep Learning: Leverage Transfer Learning

Author: Ramesh

For my current project, I was reaching about how to leverage the transfer learning techniques to improve model accuracy. In this blog, I am sharing my experience of how to leverage transfer learning in deep learning models.
In the recent development in deep learning space, we can develop a complex neural network model which trained on a very large dataset. However, the main challenge is the limitation of resources and time to train the model. Even a simple image classification model with 1000 images gets hours of training time and GPU resources. In addition, due to limited training data getting good accuracy is challenges. The transfer learning technique is used to help limited resource and time challenges.
Simply put, Transfer learning is leveraging a deep learning model which trained on millions of data on a different problem can be used to similar problem without full training. In general, deep learning models are highly re-purposeful. Since the original already training on a large dataset, it has the benefit of less training time and can result in lower generalization error. In addition, the weights with the convolution filters are well estimated due to original large training data.
Popular Pre-Training models available for Image Classification
There are many pre-trained models available for image classification problem which can be used for prediction, feature extraction, and fine-tuning. Model weights will be downloaded automatically when instantiating a Keras model. Here is the list of image classification models, which weights were trained on ImageNet. Full list available in Keras Applications page:
•         Xception
•          VGG16
•          VGG19
•          ResNet50
There are many ways we can fine-tune the original models and leverage transfer learning. Depends on the problem nature, we can use the model as it is, or train only part of the model, or add or remove the layers of the model. I take image classification as an example for this article and let us see each option in detail.
Pre-Trained Model as Feature Extractor
An image classification model is classifying the image based on the detected features. Also features detected in the bottom layer are general features of the image and go towards the top it gets more narrowed that features specific to the image class. This means the features from bottom layers features can be reused to similar images for a classification problem. Thus, we can use some or all of the layers in a pre-trained model as a feature extraction component of a new model directly.
In the VGG16 model, by removing just the top layer, which classifies image to one from 1000 classes, we can use the entire neutral network as a fixed feature extractor for the new data set. To use the pre-trained VGG16 model as feature extraction, Initiate the VGG16 model with “include_top” parameter as “False”, and define the new dataset image shape as additional parameters.
num_class = 4
image_size =300
vgg = VGG16(include_top=False,pooling='avg',weights='imagenet',input_shape=(image_size,image_size,3))
vgg.summary()
Model: "vgg16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_3 (InputLayer)         (None, 300, 300, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 300, 300, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 300, 300, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 150, 150, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 150, 150, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 150, 150, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 75, 75, 128)       0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 75, 75, 256)       295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 75, 75, 256)       590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 75, 75, 256)       590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 37, 37, 256)       0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 37, 37, 512)       1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 37, 37, 512)       2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 37, 37, 512)       2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 18, 18, 512)       0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, 18, 18, 512)       2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, 18, 18, 512)       2359808   
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 18, 18, 512)       2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 9, 9, 512)         0         
_________________________________________________________________
global_average_pooling2d_2 ( (None, 512)               0         
=================================================================
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0
_________________________________________________________________
Custom model with some layers of a pre-train model
For the problems that predict new classes that are not in the original VGG16 model, and with small datasets available for training, we can use some layers of pretrained model with new additional layers specific to the new problem what we trying to solve.
To deploy this model, replace the last few layers with a new neural network of fully connected layers. What we can do is keep the weights of initial layers of the model frozen while we retrain only the higher layers. We can try and test as to how many layers to be frozen and how many to be trained. In my experience with other projects, this approach has given me good accuracy with even small datasets.
The weights in re-used layers may be used as the starting point for the training process and adapted in response to the new problem. This usage treats transfer learning as a type of weight initialization scheme.
Here is how we can do the new model using only a few layers from the original model and save the computational time and improve the accuracy.
First, load the VGG16 pre-trained model without the final prediction layer, and input the image shape as below.
num_class = 4
image_size =300
vgg = VGG16(include_top=False,pooling='avg',weights='imagenet',input_shape=(image_size,image_size,3))
Here, I create a new model (my_model) from input layer to block1_pool layer from the VGG16 model. You can see the total trainable parameters reduce to 38k from instead of original model 14 million trainable parameters reduce to 38k
layer_name ='block1_pool'
my_model = Model(inputs=vgg.input,outputs=vgg.get_layer(layer_name).output)
my_model.summary()
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_3 (InputLayer)         (None, 300, 300, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 300, 300, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 300, 300, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 150, 150, 64)      0         
=================================================================
Total params: 38,720
Trainable params: 38,720
Non-trainable params: 0
_________________________________________________________________
Using the same technique, this time we’ll add additional convolutional blocks and two fully connected layers. Now we built our problem-specific layers on top of the pre-trained models. our custom model has the top layer with has the softmax activation function to predict the new image class. Finally, we set the trainable flag to false to freeze the layers from the first block before training.
from keras.layers import Dense,Conv2D,MaxPooling2D,BatchNormalization,GlobalAveragePooling2D
model = models.Sequential()
model.add(my_model)
model.add(Conv2D(128,(3,3),activation='relu',padding='same'))
model.add(MaxPooling2D((2,2),padding='same'))
model.add(Conv2D(256,(3,3),activation='relu',padding='same'))
model.add(MaxPooling2D((2,2),padding='same'))
model.add(GlobalAveragePooling2D())
model.add(Dense(64,activation='relu'))
model.add(BatchNormalization())
model.add(Dense(num_class,activation='softmax'))
model.layers[0].trainable=False
model.summary()
 
The final architecture for our custom model with “model.summary()”. The complete model has only 400k trainable parameters as compared to 14 million trainable parameters as in original VGG19.
Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
model_1 (Model)              (None, 150, 150, 64)      38720     
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 150, 150, 128)     73856     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 75, 75, 128)       0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 75, 75, 256)       295168    
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 38, 38, 256)       0         
_________________________________________________________________
global_average_pooling2d_4 ( (None, 256)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 64)                16448     
_________________________________________________________________
batch_normalization_2 (Batch (None, 64)                256       
_________________________________________________________________
dense_6 (Dense)              (None, 4)                 260       
=================================================================
Total params: 424,708
Trainable params: 385,860
Non-trainable params: 38,848
_________________________________________________________________
Using Architecture of the pre-trained model
Where we have a large image dataset and the classes are completely different. In this scenario, we can use the architecture of the VGG19 model and train the model with new data. Even though we need huge computational power to train the model, but it saves the time research time of model building efforts and it is a good starting point for the new problem. In my current project, I am using one of the fasterRCNN pre-trained models based on ImageNet. It helped me a lot as a starting point and easy to customize the layer after the initial training phase. On NLP space, I did use the pre-trained model as it is for embedding vector-based language models. I have used the question-answering model using the BioSentVec model that trained on the medical dataset to use COVID19 related project.
Here is the sample code snippet
#### loading VGG16 and set to untrainable
conv_base  = applications.VGG16(weights='imagenet', include_top=False, input_shape=[150, 150, 3]) 
conv_base.trainable = False
#### predict class with new image
ret1 = conv_base.predict(np.ones([1, 150, 150, 3]))
Conclusion
There are many ways we can leverage the pre-trained models. There are various architectures people have tried on different types of data sets and I strongly encourage you to go through pre-trained models’ architectures and apply them on your own problem statements. The best approach is depending on the problem what you trying to solve. I would recommend to start the evaluation of the pre-trained model with a small sample dataset and see how the pre-trained model performs. Try to leverage as many layers of pre-training model to start with and add custom layers to improve the accuracy. Especially when you have limited training data, using pre-trained model layers with custom layers is the wise approach.
Thanks!

Reference:

·      Book: Deep Learning (Adaptive Computation and Machine Learning series) by Ian Goodfellow

Go to Source