An Improved Object Detection Model based on Optimised CNN
Keywords:Object Detection, Transfer Learning, Computer Vision, Optimisation, CNN
Object detection is a computer vision technique that gives the ability to individually locate, recognise, and interpret multiple objects in an image with a better understanding. Modern image understanding tasks like image classification have been improved by state-of-the-art deep learning methods, particularly by convolutional neural networks (CNN). Region-based object detection algorithms such as Fast-RCNN achieve classification by CNN but over a longer period of time. You only look once (YOLO) prompts the object location and classification, treating object detection as a regression problem in an end-to-end network in a single step, whereas its accuracy decreases when the image has similar objects in a confined area, particularly when independent of the surrounding context. The aim of the current study is to improve YOLOv3 by optimising Darknet-53 to address the memory issue, using switchable normalisation techniques. We investigated the performance of five pre-trained networks, SqueezeNet, GoogleNet, ShuffleNet, Darknet-53, and Inception-V3, using a confusion matrix employing various epochs, learning rates, and mini-batches based on transfer learning. Darknet-53 took five times longer to complete the training and also ran into errors, most likely due to GPU memory shortages, whereas GoogleNet virtually obtained the same results in a fraction of the time. Using switchable normalisation techniques with the 10 class CIFAR-10 dataset, and utilising deep network designer (DND) of MATLAB R2021a, optimised versions of Darknet-53 increased the validation accuracy, considerably reducing the training time, and rectified the memory issue, which were then used as a backbone for YOLOv3 for effective object detection. The enhanced YOLOv3 was then assessed using a vehicle dataset and a sample Kuala Lumpur traffic scene using average precision. YOLOv3 with optimised CNN dNet-CIN as the backbone produced the best experimental results, with an FPS of 3.21 and a mAP-50 of 97%.