Note
Go to the end to download the full example code.
YOLO/PASCAL-VOC detection tutorial
This tutorial demonstrates that Akida can perform object detection. This is illustrated using a subset of the PASCAL-VOC 2007 dataset which contains 20 classes. The YOLOv2 architecture from Redmon et al (2016) has been chosen to tackle this object detection problem.
1. Introduction
1.1 Object detection
Object detection is a computer vision task that combines two elemental tasks:
object classification that consists in assigning a class label to an image like shown in the AkidaNet/ImageNet inference example
object localization that consists of drawing a bounding box around one or several objects in an image
One can learn more about the subject by reading this introduction to object detection blog article.
1.2 YOLO key concepts
You Only Look Once (YOLO) is a deep neural network architecture dedicated to object detection.
As opposed to classic networks that handle object detection, YOLO predicts bounding boxes (localization task) and class probabilities (classification task) from a single neural network in a single evaluation. The object detection task is reduced to a regression problem to spatially separated boxes and associated class probabilities.
YOLO base concept is to divide an input image into regions, forming a grid, and to predict bounding boxes and probabilities for each region. The bounding boxes are weighted by the prediction probabilities.
YOLO also uses the concept of “anchors boxes” or “prior boxes”. The network does not actually predict the actual bounding boxes but offsets from anchors boxes which are templates (width/height ratio) computed by clustering the dimensions of the ground truth boxes from the training dataset. The anchors then represent the average shape and size of the objects to detect. More details on the anchors boxes concept are given in this blog article.
Additional information about YOLO can be found on the Darknet website and source code for the preprocessing and postprocessing functions that are included in akida_models package (see the processing section in the model zoo) is largely inspired from experiencor github.
2. Preprocessing tools
A subset of VOC has been prepared with test images from VOC2007 that contains 5 examples of each class. The dataset is represented as a tfrecord file, containing images, labels, and bounding boxes.
The load_tf_dataset function is a helper function that facilitates the loading and parsing of the tfrecord file.
The YOLO toolkit offers several methods to prepare data for processing, see load_image, preprocess_image.
import tensorflow as tf
from akida_models import fetch_file
# Download TFrecords test set from Brainchip data server
data_path = fetch_file(
fname="voc_test_20_classes.tfrecord",
origin="https://data.brainchip.com/dataset-mirror/voc/test_20_classes.tfrecord",
cache_subdir='datasets/voc',
extract=True)
# Helper function to load and parse the Tfrecord file.
def load_tf_dataset(tf_record_file_path):
tfrecord_files = [tf_record_file_path]
# Feature description for parsing the TFRecord
feature_description = {
'image': tf.io.FixedLenFeature([], tf.string),
'objects/bbox': tf.io.VarLenFeature(tf.float32),
'objects/label': tf.io.VarLenFeature(tf.int64),
}
def _count_tfrecord_examples(dataset):
return len(list(dataset.as_numpy_iterator()))
def _parse_tfrecord_fn(example_proto):
example = tf.io.parse_single_example(example_proto, feature_description)
# Decode the image from bytes
example['image'] = tf.io.decode_jpeg(example['image'], channels=3)
# Convert the VarLenFeature to a dense tensor
example['objects/label'] = tf.sparse.to_dense(example['objects/label'], default_value=0)
example['objects/bbox'] = tf.sparse.to_dense(example['objects/bbox'])
# Boxes were flattenned that's why we need to reshape them
example['objects/bbox'] = tf.reshape(example['objects/bbox'],
(tf.shape(example['objects/label'])[0], 4))
# Create a new dictionary structure
objects = {
'label': example['objects/label'],
'bbox': example['objects/bbox'],
}
# Remove unnecessary keys
example.pop('objects/label')
example.pop('objects/bbox')
# Add 'objects' key to the main dictionary
example['objects'] = objects
return example
# Create a TFRecordDataset
dataset = tf.data.TFRecordDataset(tfrecord_files)
len_dataset = _count_tfrecord_examples(dataset)
parsed_dataset = dataset.map(_parse_tfrecord_fn)
return parsed_dataset, len_dataset
labels = ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse',
'motorbike', 'person', 'pottedplant', 'sheep', 'sofa',
'train', 'tvmonitor']
val_dataset, len_val_dataset = load_tf_dataset(data_path)
print(f"Loaded VOC2007 sample test data: {len_val_dataset} images.")
Downloading data from https://data.brainchip.com/dataset-mirror/voc/test_20_classes.tfrecord.
0/8399422 [..............................] - ETA: 0s
114688/8399422 [..............................] - ETA: 3s
778240/8399422 [=>............................] - ETA: 0s
1662976/8399422 [====>.........................] - ETA: 0s
2531328/8399422 [========>.....................] - ETA: 0s
3571712/8399422 [===========>..................] - ETA: 0s
4644864/8399422 [===============>..............] - ETA: 0s
5472256/8399422 [==================>...........] - ETA: 0s
6520832/8399422 [======================>.......] - ETA: 0s
7847936/8399422 [===========================>..] - ETA: 0s
8399422/8399422 [==============================] - 0s 0us/step
Download complete.
Loaded VOC2007 sample test data: 100 images.
Anchors can also be computed easily using YOLO toolkit.
Note
The following code is given as an example. In a real use case scenario, anchors are computed on the training dataset.
from akida_models.detection.generate_anchors import generate_anchors
num_anchors = 5
grid_size = (7, 7)
anchors_example = generate_anchors(val_dataset, num_anchors, grid_size)
Average IOU for 5 anchors: 0.70
Anchors: [[1.09818, 2.02575], [2.22135, 2.81762], [2.91442, 3.24278], [4.70526, 5.3364], [5.26481, 5.67309]]
3. Model architecture
The model zoo contains a YOLO model that is built upon the AkidaNet architecture and 3 separable convolutional layers at the top for bounding box and class estimation followed by a final separable convolutional which is the detection layer. Note that for efficiency, the alpha parameter in AkidaNet (network width or number of filter in each layer) is set to 0.5.
from akida_models import yolo_base
# Create a yolo model for 20 classes with 5 anchors and grid size of 7
classes = len(labels)
model = yolo_base(input_shape=(224, 224, 3),
classes=classes,
nb_box=num_anchors,
alpha=0.5)
model.summary()
Model: "yolo_base"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) [(None, 224, 224, 3)] 0
rescaling (Rescaling) (None, 224, 224, 3) 0
conv_0 (Conv2D) (None, 112, 112, 16) 432
conv_0/BN (BatchNormalizat (None, 112, 112, 16) 64
ion)
conv_0/relu (ReLU) (None, 112, 112, 16) 0
conv_1 (Conv2D) (None, 112, 112, 32) 4608
conv_1/BN (BatchNormalizat (None, 112, 112, 32) 128
ion)
conv_1/relu (ReLU) (None, 112, 112, 32) 0
conv_2 (Conv2D) (None, 56, 56, 64) 18432
conv_2/BN (BatchNormalizat (None, 56, 56, 64) 256
ion)
conv_2/relu (ReLU) (None, 56, 56, 64) 0
conv_3 (Conv2D) (None, 56, 56, 64) 36864
conv_3/BN (BatchNormalizat (None, 56, 56, 64) 256
ion)
conv_3/relu (ReLU) (None, 56, 56, 64) 0
dw_separable_4 (DepthwiseC (None, 28, 28, 64) 576
onv2D)
pw_separable_4 (Conv2D) (None, 28, 28, 128) 8192
pw_separable_4/BN (BatchNo (None, 28, 28, 128) 512
rmalization)
pw_separable_4/relu (ReLU) (None, 28, 28, 128) 0
dw_separable_5 (DepthwiseC (None, 28, 28, 128) 1152
onv2D)
pw_separable_5 (Conv2D) (None, 28, 28, 128) 16384
pw_separable_5/BN (BatchNo (None, 28, 28, 128) 512
rmalization)
pw_separable_5/relu (ReLU) (None, 28, 28, 128) 0
dw_separable_6 (DepthwiseC (None, 14, 14, 128) 1152
onv2D)
pw_separable_6 (Conv2D) (None, 14, 14, 256) 32768
pw_separable_6/BN (BatchNo (None, 14, 14, 256) 1024
rmalization)
pw_separable_6/relu (ReLU) (None, 14, 14, 256) 0
dw_separable_7 (DepthwiseC (None, 14, 14, 256) 2304
onv2D)
pw_separable_7 (Conv2D) (None, 14, 14, 256) 65536
pw_separable_7/BN (BatchNo (None, 14, 14, 256) 1024
rmalization)
pw_separable_7/relu (ReLU) (None, 14, 14, 256) 0
dw_separable_8 (DepthwiseC (None, 14, 14, 256) 2304
onv2D)
pw_separable_8 (Conv2D) (None, 14, 14, 256) 65536
pw_separable_8/BN (BatchNo (None, 14, 14, 256) 1024
rmalization)
pw_separable_8/relu (ReLU) (None, 14, 14, 256) 0
dw_separable_9 (DepthwiseC (None, 14, 14, 256) 2304
onv2D)
pw_separable_9 (Conv2D) (None, 14, 14, 256) 65536
pw_separable_9/BN (BatchNo (None, 14, 14, 256) 1024
rmalization)
pw_separable_9/relu (ReLU) (None, 14, 14, 256) 0
dw_separable_10 (Depthwise (None, 14, 14, 256) 2304
Conv2D)
pw_separable_10 (Conv2D) (None, 14, 14, 256) 65536
pw_separable_10/BN (BatchN (None, 14, 14, 256) 1024
ormalization)
pw_separable_10/relu (ReLU (None, 14, 14, 256) 0
)
dw_separable_11 (Depthwise (None, 14, 14, 256) 2304
Conv2D)
pw_separable_11 (Conv2D) (None, 14, 14, 256) 65536
pw_separable_11/BN (BatchN (None, 14, 14, 256) 1024
ormalization)
pw_separable_11/relu (ReLU (None, 14, 14, 256) 0
)
dw_separable_12 (Depthwise (None, 7, 7, 256) 2304
Conv2D)
pw_separable_12 (Conv2D) (None, 7, 7, 512) 131072
pw_separable_12/BN (BatchN (None, 7, 7, 512) 2048
ormalization)
pw_separable_12/relu (ReLU (None, 7, 7, 512) 0
)
dw_separable_13 (Depthwise (None, 7, 7, 512) 4608
Conv2D)
pw_separable_13 (Conv2D) (None, 7, 7, 512) 262144
pw_separable_13/BN (BatchN (None, 7, 7, 512) 2048
ormalization)
pw_separable_13/relu (ReLU (None, 7, 7, 512) 0
)
dw_1conv (DepthwiseConv2D) (None, 7, 7, 512) 4608
pw_1conv (Conv2D) (None, 7, 7, 1024) 524288
pw_1conv/BN (BatchNormaliz (None, 7, 7, 1024) 4096
ation)
pw_1conv/relu (ReLU) (None, 7, 7, 1024) 0
dw_2conv (DepthwiseConv2D) (None, 7, 7, 1024) 9216
pw_2conv (Conv2D) (None, 7, 7, 1024) 1048576
pw_2conv/BN (BatchNormaliz (None, 7, 7, 1024) 4096
ation)
pw_2conv/relu (ReLU) (None, 7, 7, 1024) 0
dw_3conv (DepthwiseConv2D) (None, 7, 7, 1024) 9216
pw_3conv (Conv2D) (None, 7, 7, 1024) 1048576
pw_3conv/BN (BatchNormaliz (None, 7, 7, 1024) 4096
ation)
pw_3conv/relu (ReLU) (None, 7, 7, 1024) 0
dw_detection_layer (Depthw (None, 7, 7, 1024) 9216
iseConv2D)
pw_detection_layer (Conv2D (None, 7, 7, 125) 128125
)
=================================================================
Total params: 3665965 (13.98 MB)
Trainable params: 3653837 (13.94 MB)
Non-trainable params: 12128 (47.38 KB)
_________________________________________________________________
The model output can be reshaped to a more natural shape of:
(grid_height, grid_width, anchors_box, 4 + 1 + num_classes)
where the “4 + 1” term represents the coordinates of the estimated bounding boxes (top left x, top left y, width and height) and a confidence score. In other words, the output channels are actually grouped by anchor boxes, and in each group one channel provides either a coordinate, a global confidence score or a class confidence score. This process is done automatically in the decode_output function.
from tensorflow.keras import Model
from tensorflow.keras.layers import Reshape
# Define a reshape output to be added to the YOLO model
output = Reshape((grid_size[1], grid_size[0], num_anchors, 4 + 1 + classes),
name="YOLO_output")(model.output)
# Build the complete model
full_model = Model(model.input, output)
full_model.output
<KerasTensor: shape=(None, 7, 7, 5, 25) dtype=float32 (created by layer 'YOLO_output')>
4. Training
As the YOLO model relies on Brainchip AkidaNet/ImageNet network, it is possible to perform transfer learning from ImageNet pretrained weights when training a YOLO model. See the PlantVillage transfer learning example for a detail explanation on transfer learning principles. Additionally, for achieving optimal results, consider the following approach:
1. Initially, train the model on the COCO dataset. This process helps in learning general object detection features and improves the model’s ability to detect various objects across different contexts.
2. After training on COCO, transfer the learned weights to a model equipped with a VOC head.
3. Fine-tune the transferred weights on the VOC dataset. This step allows the model to adapt to the specific characteristics and nuances of the VOC dataset, further enhancing its performance on VOC-related tasks.
5. Performance
The model zoo also contains an helper method that allows to create a YOLO model for VOC and load pretrained weights for the detection task and the corresponding anchors. The anchors are used to interpret the model outputs.
The metric used to evaluate YOLO is the mean average precision (mAP) which is the percentage of correct prediction and is given for an intersection over union (IoU) ratio. Scores in this example are given for the standard IoU of 0.5, 0.75 and the mean across IoU thresholds ranging from 0.5 to 0.95, meaning that a detection is considered valid if the intersection over union ratio with its ground truth equivalent is above 0.5 for mAP 50 or above 0.75 for mAP 75.
Note
A call to evaluate_map will preprocess the images, make the call to
Model.predict
and use decode_output before computing precision for all classes.
from timeit import default_timer as timer
from akida_models import yolo_voc_pretrained
from akida_models.detection.map_evaluation import MapEvaluation
# Load the pretrained model along with anchors
model_keras, anchors = yolo_voc_pretrained()
model_keras.summary()
Downloading data from https://data.brainchip.com/dataset-mirror/coco/coco_anchors.pkl.
0/126 [..............................] - ETA: 0s
126/126 [==============================] - 0s 1us/step
Download complete.
Downloading data from https://data.brainchip.com/models/AkidaV2/yolo/yolo_akidanet_voc_i8_w4_a4.h5.
0/14926320 [..............................] - ETA: 0s
114688/14926320 [..............................] - ETA: 6s
819200/14926320 [>.............................] - ETA: 1s
1744896/14926320 [==>...........................] - ETA: 1s
2629632/14926320 [====>.........................] - ETA: 0s
3686400/14926320 [======>.......................] - ETA: 0s
5095424/14926320 [=========>....................] - ETA: 0s
6512640/14926320 [============>.................] - ETA: 0s
7634944/14926320 [==============>...............] - ETA: 0s
8978432/14926320 [=================>............] - ETA: 0s
10240000/14926320 [===================>..........] - ETA: 0s
11517952/14926320 [======================>.......] - ETA: 0s
12754944/14926320 [========================>.....] - ETA: 0s
14458880/14926320 [============================>.] - ETA: 0s
14926320/14926320 [==============================] - 1s 0us/step
Download complete.
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input (InputLayer) [(None, 224, 224, 3)] 0
rescaling (QuantizedRescal (None, 224, 224, 3) 0
ing)
conv_0 (QuantizedConv2D) (None, 112, 112, 16) 448
conv_0/relu (QuantizedReLU (None, 112, 112, 16) 32
)
conv_1 (QuantizedConv2D) (None, 112, 112, 32) 4640
conv_1/relu (QuantizedReLU (None, 112, 112, 32) 64
)
conv_2 (QuantizedConv2D) (None, 56, 56, 64) 18496
conv_2/relu (QuantizedReLU (None, 56, 56, 64) 128
)
conv_3 (QuantizedConv2D) (None, 56, 56, 64) 36928
conv_3/relu (QuantizedReLU (None, 56, 56, 64) 128
)
dw_separable_4 (QuantizedD (None, 28, 28, 64) 704
epthwiseConv2D)
pw_separable_4 (QuantizedC (None, 28, 28, 128) 8320
onv2D)
pw_separable_4/relu (Quant (None, 28, 28, 128) 256
izedReLU)
dw_separable_5 (QuantizedD (None, 28, 28, 128) 1408
epthwiseConv2D)
pw_separable_5 (QuantizedC (None, 28, 28, 128) 16512
onv2D)
pw_separable_5/relu (Quant (None, 28, 28, 128) 256
izedReLU)
dw_separable_6 (QuantizedD (None, 14, 14, 128) 1408
epthwiseConv2D)
pw_separable_6 (QuantizedC (None, 14, 14, 256) 33024
onv2D)
pw_separable_6/relu (Quant (None, 14, 14, 256) 512
izedReLU)
dw_separable_7 (QuantizedD (None, 14, 14, 256) 2816
epthwiseConv2D)
pw_separable_7 (QuantizedC (None, 14, 14, 256) 65792
onv2D)
pw_separable_7/relu (Quant (None, 14, 14, 256) 512
izedReLU)
dw_separable_8 (QuantizedD (None, 14, 14, 256) 2816
epthwiseConv2D)
pw_separable_8 (QuantizedC (None, 14, 14, 256) 65792
onv2D)
pw_separable_8/relu (Quant (None, 14, 14, 256) 512
izedReLU)
dw_separable_9 (QuantizedD (None, 14, 14, 256) 2816
epthwiseConv2D)
pw_separable_9 (QuantizedC (None, 14, 14, 256) 65792
onv2D)
pw_separable_9/relu (Quant (None, 14, 14, 256) 512
izedReLU)
dw_separable_10 (Quantized (None, 14, 14, 256) 2816
DepthwiseConv2D)
pw_separable_10 (Quantized (None, 14, 14, 256) 65792
Conv2D)
pw_separable_10/relu (Quan (None, 14, 14, 256) 512
tizedReLU)
dw_separable_11 (Quantized (None, 14, 14, 256) 2816
DepthwiseConv2D)
pw_separable_11 (Quantized (None, 14, 14, 256) 65792
Conv2D)
pw_separable_11/relu (Quan (None, 14, 14, 256) 512
tizedReLU)
dw_separable_12 (Quantized (None, 7, 7, 256) 2816
DepthwiseConv2D)
pw_separable_12 (Quantized (None, 7, 7, 512) 131584
Conv2D)
pw_separable_12/relu (Quan (None, 7, 7, 512) 1024
tizedReLU)
dw_separable_13 (Quantized (None, 7, 7, 512) 5632
DepthwiseConv2D)
pw_separable_13 (Quantized (None, 7, 7, 512) 262656
Conv2D)
pw_separable_13/relu (Quan (None, 7, 7, 512) 1024
tizedReLU)
dw_1conv (QuantizedDepthwi (None, 7, 7, 512) 5632
seConv2D)
pw_1conv (QuantizedConv2D) (None, 7, 7, 1024) 525312
pw_1conv/relu (QuantizedRe (None, 7, 7, 1024) 2048
LU)
dw_2conv (QuantizedDepthwi (None, 7, 7, 1024) 11264
seConv2D)
pw_2conv (QuantizedConv2D) (None, 7, 7, 1024) 1049600
pw_2conv/relu (QuantizedRe (None, 7, 7, 1024) 2048
LU)
dw_3conv (QuantizedDepthwi (None, 7, 7, 1024) 11264
seConv2D)
pw_3conv (QuantizedConv2D) (None, 7, 7, 1024) 1049600
pw_3conv/relu (QuantizedRe (None, 7, 7, 1024) 2048
LU)
dw_detection_layer (Quanti (None, 7, 7, 1024) 11264
zedDepthwiseConv2D)
voc_classifier (QuantizedC (None, 7, 7, 125) 128125
onv2D)
dequantizer (Dequantizer) (None, 7, 7, 125) 0
=================================================================
Total params: 3671805 (14.01 MB)
Trainable params: 3647773 (13.92 MB)
Non-trainable params: 24032 (93.88 KB)
_________________________________________________________________
# Define the final reshape and build the model
output = Reshape((grid_size[1], grid_size[0], num_anchors, 4 + 1 + classes),
name="YOLO_output")(model_keras.output)
model_keras = Model(model_keras.input, output)
# Create the mAP evaluator object
map_evaluator = MapEvaluation(model_keras, val_dataset,
len_val_dataset, labels, anchors)
# Compute the scores for all validation images
start = timer()
map_dict, average_precisions = map_evaluator.evaluate_map()
mAP = sum(map_dict.values()) / len(map_dict)
end = timer()
for label, average_precision in average_precisions.items():
print(labels[label], '{:.4f}'.format(average_precision))
print('mAP 50: {:.4f}'.format(map_dict[0.5]))
print('mAP 75: {:.4f}'.format(map_dict[0.75]))
print('mAP: {:.4f}'.format(mAP))
print(f'Keras inference on {len_val_dataset} images took {end-start:.2f} s.\n')
0%| | 0/130 [00:00<?, ?it/s]
Getting predictions: 0%| | 0/130 [00:00<?, ?it/s]
Getting predictions: 1%| | 1/130 [00:11<25:06, 11.68s/it]
Getting predictions: 2%|▏ | 3/130 [00:11<06:31, 3.08s/it]
Getting predictions: 4%|▍ | 5/130 [00:11<03:11, 1.53s/it]
Getting predictions: 5%|▌ | 7/130 [00:12<01:53, 1.08it/s]
Getting predictions: 7%|▋ | 9/130 [00:12<01:13, 1.65it/s]
Getting predictions: 8%|▊ | 11/130 [00:12<00:49, 2.39it/s]
Getting predictions: 10%|█ | 13/130 [00:12<00:35, 3.30it/s]
Getting predictions: 12%|█▏ | 15/130 [00:12<00:26, 4.37it/s]
Getting predictions: 13%|█▎ | 17/130 [00:12<00:20, 5.59it/s]
Getting predictions: 15%|█▍ | 19/130 [00:13<00:16, 6.81it/s]
Getting predictions: 16%|█▌ | 21/130 [00:13<00:13, 8.14it/s]
Getting predictions: 18%|█▊ | 23/130 [00:13<00:12, 8.91it/s]
Getting predictions: 19%|█▉ | 25/130 [00:13<00:10, 9.69it/s]
Getting predictions: 21%|██ | 27/130 [00:13<00:09, 10.77it/s]
Getting predictions: 22%|██▏ | 29/130 [00:13<00:08, 11.61it/s]
Getting predictions: 24%|██▍ | 31/130 [00:13<00:08, 12.24it/s]
Getting predictions: 25%|██▌ | 33/130 [00:14<00:07, 12.81it/s]
Getting predictions: 27%|██▋ | 35/130 [00:14<00:07, 13.18it/s]
Getting predictions: 28%|██▊ | 37/130 [00:14<00:06, 13.32it/s]
Getting predictions: 30%|███ | 39/130 [00:14<00:06, 13.35it/s]
Getting predictions: 32%|███▏ | 41/130 [00:14<00:06, 13.72it/s]
Getting predictions: 33%|███▎ | 43/130 [00:14<00:06, 13.31it/s]
Getting predictions: 35%|███▍ | 45/130 [00:14<00:06, 13.10it/s]
Getting predictions: 36%|███▌ | 47/130 [00:15<00:06, 13.26it/s]
Getting predictions: 38%|███▊ | 49/130 [00:15<00:05, 13.56it/s]
Getting predictions: 39%|███▉ | 51/130 [00:15<00:05, 13.46it/s]
Getting predictions: 41%|████ | 53/130 [00:15<00:05, 12.91it/s]
Getting predictions: 42%|████▏ | 55/130 [00:15<00:05, 13.19it/s]
Getting predictions: 44%|████▍ | 57/130 [00:15<00:05, 13.41it/s]
Getting predictions: 45%|████▌ | 59/130 [00:16<00:05, 13.62it/s]
Getting predictions: 47%|████▋ | 61/130 [00:16<00:05, 13.49it/s]
Getting predictions: 48%|████▊ | 63/130 [00:16<00:05, 13.30it/s]
Getting predictions: 50%|█████ | 65/130 [00:16<00:05, 12.84it/s]
Getting predictions: 52%|█████▏ | 67/130 [00:16<00:05, 12.45it/s]
Getting predictions: 53%|█████▎ | 69/130 [00:16<00:04, 12.27it/s]
Getting predictions: 55%|█████▍ | 71/130 [00:16<00:04, 11.91it/s]
Getting predictions: 56%|█████▌ | 73/130 [00:17<00:04, 12.61it/s]
Getting predictions: 58%|█████▊ | 75/130 [00:17<00:04, 11.71it/s]
Getting predictions: 59%|█████▉ | 77/130 [00:17<00:04, 11.07it/s]
Getting predictions: 61%|██████ | 79/130 [00:17<00:04, 11.92it/s]
Getting predictions: 62%|██████▏ | 81/130 [00:17<00:03, 12.30it/s]
Getting predictions: 64%|██████▍ | 83/130 [00:17<00:03, 12.71it/s]
Getting predictions: 65%|██████▌ | 85/130 [00:18<00:03, 11.94it/s]
Getting predictions: 67%|██████▋ | 87/130 [00:18<00:03, 12.60it/s]
Getting predictions: 68%|██████▊ | 89/130 [00:18<00:03, 12.99it/s]
Getting predictions: 70%|███████ | 91/130 [00:18<00:02, 13.28it/s]
Getting predictions: 72%|███████▏ | 93/130 [00:18<00:02, 13.50it/s]
Getting predictions: 73%|███████▎ | 95/130 [00:18<00:02, 13.01it/s]
Getting predictions: 75%|███████▍ | 97/130 [00:19<00:02, 13.27it/s]
Getting predictions: 76%|███████▌ | 99/130 [00:19<00:02, 11.83it/s]
Computing overlaps: 77%|███████▋ | 100/130 [00:19<00:02, 11.83it/s]
Computing overlaps: 78%|███████▊ | 101/130 [00:19<00:03, 7.33it/s]
Computing overlaps: 88%|████████▊ | 115/130 [00:19<00:00, 24.16it/s]
Computing average precisions th = 0.50: 92%|█████████▏| 120/130 [00:19<00:00, 24.16it/s]
Computing average precisions th = 0.55: 93%|█████████▎| 121/130 [00:19<00:00, 24.16it/s]
Computing average precisions th = 0.60: 94%|█████████▍| 122/130 [00:19<00:00, 24.16it/s]
Computing average precisions th = 0.65: 95%|█████████▍| 123/130 [00:19<00:00, 24.16it/s]
Computing average precisions th = 0.70: 95%|█████████▌| 124/130 [00:19<00:00, 24.16it/s]
Computing average precisions th = 0.75: 96%|█████████▌| 125/130 [00:19<00:00, 24.16it/s]
Computing average precisions th = 0.80: 97%|█████████▋| 126/130 [00:19<00:00, 24.16it/s]
Computing average precisions th = 0.85: 98%|█████████▊| 127/130 [00:19<00:00, 24.16it/s]
Computing average precisions th = 0.90: 98%|█████████▊| 128/130 [00:19<00:00, 24.16it/s]
Computing average precisions th = 0.95: 99%|█████████▉| 129/130 [00:19<00:00, 24.16it/s]
aeroplane 0.7300
bicycle 0.4417
bird 0.4833
boat 0.3070
bottle 0.2627
bus 0.7147
car 0.6889
cat 0.7670
chair 0.2726
cow 0.3700
diningtable 0.4711
dog 0.5736
horse 0.6147
motorbike 0.5083
person 0.4021
pottedplant 0.1094
sheep 0.2976
sofa 0.6283
train 0.6042
tvmonitor 0.5643
mAP 50: 0.8409
mAP 75: 0.5086
mAP: 0.4906
Keras inference on 100 images took 20.08 s.
6. Conversion to Akida
6.1 Convert to Akida model
The last YOLO_output layer that was added for splitting channels into values for each box must be removed before Akida conversion.
# Rebuild a model without the last layer
compatible_model = Model(model_keras.input, model_keras.layers[-2].output)
When converting to an Akida model, we just need to pass the Keras model to cnn2snn.convert.
from cnn2snn import convert
model_akida = convert(compatible_model)
model_akida.summary()
Model Summary
________________________________________________
Input shape Output shape Sequences Layers
================================================
[224, 224, 3] [7, 7, 125] 1 33
________________________________________________
__________________________________________________________________________
Layer (type) Output shape Kernel shape
==================== SW/conv_0-dequantizer (Software) ====================
conv_0 (InputConv2D) [112, 112, 16] (3, 3, 3, 16)
__________________________________________________________________________
conv_1 (Conv2D) [112, 112, 32] (3, 3, 16, 32)
__________________________________________________________________________
conv_2 (Conv2D) [56, 56, 64] (3, 3, 32, 64)
__________________________________________________________________________
conv_3 (Conv2D) [56, 56, 64] (3, 3, 64, 64)
__________________________________________________________________________
dw_separable_4 (DepthwiseConv2D) [28, 28, 64] (3, 3, 64, 1)
__________________________________________________________________________
pw_separable_4 (Conv2D) [28, 28, 128] (1, 1, 64, 128)
__________________________________________________________________________
dw_separable_5 (DepthwiseConv2D) [28, 28, 128] (3, 3, 128, 1)
__________________________________________________________________________
pw_separable_5 (Conv2D) [28, 28, 128] (1, 1, 128, 128)
__________________________________________________________________________
dw_separable_6 (DepthwiseConv2D) [14, 14, 128] (3, 3, 128, 1)
__________________________________________________________________________
pw_separable_6 (Conv2D) [14, 14, 256] (1, 1, 128, 256)
__________________________________________________________________________
dw_separable_7 (DepthwiseConv2D) [14, 14, 256] (3, 3, 256, 1)
__________________________________________________________________________
pw_separable_7 (Conv2D) [14, 14, 256] (1, 1, 256, 256)
__________________________________________________________________________
dw_separable_8 (DepthwiseConv2D) [14, 14, 256] (3, 3, 256, 1)
__________________________________________________________________________
pw_separable_8 (Conv2D) [14, 14, 256] (1, 1, 256, 256)
__________________________________________________________________________
dw_separable_9 (DepthwiseConv2D) [14, 14, 256] (3, 3, 256, 1)
__________________________________________________________________________
pw_separable_9 (Conv2D) [14, 14, 256] (1, 1, 256, 256)
__________________________________________________________________________
dw_separable_10 (DepthwiseConv2D) [14, 14, 256] (3, 3, 256, 1)
__________________________________________________________________________
pw_separable_10 (Conv2D) [14, 14, 256] (1, 1, 256, 256)
__________________________________________________________________________
dw_separable_11 (DepthwiseConv2D) [14, 14, 256] (3, 3, 256, 1)
__________________________________________________________________________
pw_separable_11 (Conv2D) [14, 14, 256] (1, 1, 256, 256)
__________________________________________________________________________
dw_separable_12 (DepthwiseConv2D) [7, 7, 256] (3, 3, 256, 1)
__________________________________________________________________________
pw_separable_12 (Conv2D) [7, 7, 512] (1, 1, 256, 512)
__________________________________________________________________________
dw_separable_13 (DepthwiseConv2D) [7, 7, 512] (3, 3, 512, 1)
__________________________________________________________________________
pw_separable_13 (Conv2D) [7, 7, 512] (1, 1, 512, 512)
__________________________________________________________________________
dw_1conv (DepthwiseConv2D) [7, 7, 512] (3, 3, 512, 1)
__________________________________________________________________________
pw_1conv (Conv2D) [7, 7, 1024] (1, 1, 512, 1024)
__________________________________________________________________________
dw_2conv (DepthwiseConv2D) [7, 7, 1024] (3, 3, 1024, 1)
__________________________________________________________________________
pw_2conv (Conv2D) [7, 7, 1024] (1, 1, 1024, 1024)
__________________________________________________________________________
dw_3conv (DepthwiseConv2D) [7, 7, 1024] (3, 3, 1024, 1)
__________________________________________________________________________
pw_3conv (Conv2D) [7, 7, 1024] (1, 1, 1024, 1024)
__________________________________________________________________________
dw_detection_layer (DepthwiseConv2D) [7, 7, 1024] (3, 3, 1024, 1)
__________________________________________________________________________
voc_classifier (Conv2D) [7, 7, 125] (1, 1, 1024, 125)
__________________________________________________________________________
dequantizer (Dequantizer) [7, 7, 125] N/A
__________________________________________________________________________
6.1 Check performance
Akida model accuracy is tested on the first n images of the validation set.
# Create the mAP evaluator object
map_evaluator_ak = MapEvaluation(model_akida,
val_dataset,
len_val_dataset,
labels,
anchors,
is_keras_model=False)
# Compute the scores for all validation images
start = timer()
map_ak_dict, average_precisions_ak = map_evaluator_ak.evaluate_map()
mAP_ak = sum(map_ak_dict.values()) / len(map_ak_dict)
end = timer()
for label, average_precision in average_precisions_ak.items():
print(labels[label], '{:.4f}'.format(average_precision))
print('mAP 50: {:.4f}'.format(map_ak_dict[0.5]))
print('mAP 75: {:.4f}'.format(map_ak_dict[0.75]))
print('mAP: {:.4f}'.format(mAP_ak))
print(f'Akida inference on {len_val_dataset} images took {end-start:.2f} s.\n')
0%| | 0/130 [00:00<?, ?it/s]
Getting predictions: 0%| | 0/130 [00:00<?, ?it/s]
Getting predictions: 1%| | 1/130 [00:00<00:17, 7.55it/s]
Getting predictions: 2%|▏ | 2/130 [00:00<00:15, 8.08it/s]
Getting predictions: 2%|▏ | 3/130 [00:00<00:15, 8.19it/s]
Getting predictions: 3%|▎ | 4/130 [00:00<00:15, 8.25it/s]
Getting predictions: 4%|▍ | 5/130 [00:00<00:14, 8.41it/s]
Getting predictions: 5%|▍ | 6/130 [00:00<00:14, 8.35it/s]
Getting predictions: 5%|▌ | 7/130 [00:00<00:16, 7.68it/s]
Getting predictions: 6%|▌ | 8/130 [00:01<00:15, 7.65it/s]
Getting predictions: 7%|▋ | 9/130 [00:01<00:15, 7.89it/s]
Getting predictions: 8%|▊ | 10/130 [00:01<00:15, 7.99it/s]
Getting predictions: 8%|▊ | 11/130 [00:01<00:14, 8.10it/s]
Getting predictions: 9%|▉ | 12/130 [00:01<00:14, 8.17it/s]
Getting predictions: 10%|█ | 13/130 [00:01<00:14, 8.15it/s]
Getting predictions: 11%|█ | 14/130 [00:01<00:14, 8.15it/s]
Getting predictions: 12%|█▏ | 15/130 [00:01<00:13, 8.32it/s]
Getting predictions: 12%|█▏ | 16/130 [00:01<00:13, 8.27it/s]
Getting predictions: 13%|█▎ | 17/130 [00:02<00:13, 8.45it/s]
Getting predictions: 14%|█▍ | 18/130 [00:02<00:13, 8.52it/s]
Getting predictions: 15%|█▍ | 19/130 [00:02<00:13, 8.27it/s]
Getting predictions: 15%|█▌ | 20/130 [00:02<00:13, 8.45it/s]
Getting predictions: 16%|█▌ | 21/130 [00:02<00:12, 8.46it/s]
Getting predictions: 17%|█▋ | 22/130 [00:02<00:13, 8.11it/s]
Getting predictions: 18%|█▊ | 23/130 [00:02<00:13, 7.84it/s]
Getting predictions: 18%|█▊ | 24/130 [00:02<00:13, 7.88it/s]
Getting predictions: 19%|█▉ | 25/130 [00:03<00:13, 7.63it/s]
Getting predictions: 20%|██ | 26/130 [00:03<00:13, 7.89it/s]
Getting predictions: 21%|██ | 27/130 [00:03<00:12, 8.06it/s]
Getting predictions: 22%|██▏ | 28/130 [00:03<00:12, 8.15it/s]
Getting predictions: 22%|██▏ | 29/130 [00:03<00:12, 8.18it/s]
Getting predictions: 23%|██▎ | 30/130 [00:03<00:11, 8.35it/s]
Getting predictions: 24%|██▍ | 31/130 [00:03<00:12, 8.22it/s]
Getting predictions: 25%|██▍ | 32/130 [00:03<00:11, 8.34it/s]
Getting predictions: 25%|██▌ | 33/130 [00:04<00:11, 8.43it/s]
Getting predictions: 26%|██▌ | 34/130 [00:04<00:11, 8.51it/s]
Getting predictions: 27%|██▋ | 35/130 [00:04<00:11, 8.34it/s]
Getting predictions: 28%|██▊ | 36/130 [00:04<00:11, 8.21it/s]
Getting predictions: 28%|██▊ | 37/130 [00:04<00:11, 8.25it/s]
Getting predictions: 29%|██▉ | 38/130 [00:04<00:11, 8.09it/s]
Getting predictions: 30%|███ | 39/130 [00:04<00:11, 8.18it/s]
Getting predictions: 31%|███ | 40/130 [00:04<00:10, 8.21it/s]
Getting predictions: 32%|███▏ | 41/130 [00:05<00:10, 8.33it/s]
Getting predictions: 32%|███▏ | 42/130 [00:05<00:10, 8.07it/s]
Getting predictions: 33%|███▎ | 43/130 [00:05<00:10, 7.95it/s]
Getting predictions: 34%|███▍ | 44/130 [00:05<00:11, 7.82it/s]
Getting predictions: 35%|███▍ | 45/130 [00:05<00:10, 7.91it/s]
Getting predictions: 35%|███▌ | 46/130 [00:05<00:10, 8.17it/s]
Getting predictions: 36%|███▌ | 47/130 [00:05<00:10, 8.08it/s]
Getting predictions: 37%|███▋ | 48/130 [00:05<00:09, 8.23it/s]
Getting predictions: 38%|███▊ | 49/130 [00:06<00:09, 8.25it/s]
Getting predictions: 38%|███▊ | 50/130 [00:06<00:09, 8.33it/s]
Getting predictions: 39%|███▉ | 51/130 [00:06<00:09, 8.08it/s]
Getting predictions: 40%|████ | 52/130 [00:06<00:09, 8.13it/s]
Getting predictions: 41%|████ | 53/130 [00:06<00:09, 7.74it/s]
Getting predictions: 42%|████▏ | 54/130 [00:06<00:09, 8.03it/s]
Getting predictions: 42%|████▏ | 55/130 [00:06<00:09, 7.97it/s]
Getting predictions: 43%|████▎ | 56/130 [00:06<00:09, 8.13it/s]
Getting predictions: 44%|████▍ | 57/130 [00:07<00:08, 8.11it/s]
Getting predictions: 45%|████▍ | 58/130 [00:07<00:08, 8.23it/s]
Getting predictions: 45%|████▌ | 59/130 [00:07<00:08, 8.29it/s]
Getting predictions: 46%|████▌ | 60/130 [00:07<00:08, 8.32it/s]
Getting predictions: 47%|████▋ | 61/130 [00:07<00:08, 8.01it/s]
Getting predictions: 48%|████▊ | 62/130 [00:07<00:08, 7.98it/s]
Getting predictions: 48%|████▊ | 63/130 [00:07<00:08, 8.02it/s]
Getting predictions: 49%|████▉ | 64/130 [00:07<00:08, 7.92it/s]
Getting predictions: 50%|█████ | 65/130 [00:08<00:08, 7.72it/s]
Getting predictions: 51%|█████ | 66/130 [00:08<00:08, 7.70it/s]
Getting predictions: 52%|█████▏ | 67/130 [00:08<00:08, 7.57it/s]
Getting predictions: 52%|█████▏ | 68/130 [00:08<00:08, 7.60it/s]
Getting predictions: 53%|█████▎ | 69/130 [00:08<00:08, 7.52it/s]
Getting predictions: 54%|█████▍ | 70/130 [00:08<00:07, 7.50it/s]
Getting predictions: 55%|█████▍ | 71/130 [00:08<00:08, 7.29it/s]
Getting predictions: 55%|█████▌ | 72/130 [00:08<00:07, 7.60it/s]
Getting predictions: 56%|█████▌ | 73/130 [00:09<00:07, 7.87it/s]
Getting predictions: 57%|█████▋ | 74/130 [00:09<00:07, 7.82it/s]
Getting predictions: 58%|█████▊ | 75/130 [00:09<00:07, 7.15it/s]
Getting predictions: 58%|█████▊ | 76/130 [00:09<00:07, 7.27it/s]
Getting predictions: 59%|█████▉ | 77/130 [00:09<00:07, 6.85it/s]
Getting predictions: 60%|██████ | 78/130 [00:09<00:07, 7.21it/s]
Getting predictions: 61%|██████ | 79/130 [00:09<00:06, 7.55it/s]
Getting predictions: 62%|██████▏ | 80/130 [00:10<00:06, 7.62it/s]
Getting predictions: 62%|██████▏ | 81/130 [00:10<00:06, 7.83it/s]
Getting predictions: 63%|██████▎ | 82/130 [00:10<00:05, 8.02it/s]
Getting predictions: 64%|██████▍ | 83/130 [00:10<00:05, 7.94it/s]
Getting predictions: 65%|██████▍ | 84/130 [00:10<00:05, 7.86it/s]
Getting predictions: 65%|██████▌ | 85/130 [00:10<00:06, 7.35it/s]
Getting predictions: 66%|██████▌ | 86/130 [00:10<00:05, 7.62it/s]
Getting predictions: 67%|██████▋ | 87/130 [00:10<00:05, 7.89it/s]
Getting predictions: 68%|██████▊ | 88/130 [00:11<00:05, 8.05it/s]
Getting predictions: 68%|██████▊ | 89/130 [00:11<00:05, 8.08it/s]
Getting predictions: 69%|██████▉ | 90/130 [00:11<00:04, 8.11it/s]
Getting predictions: 70%|███████ | 91/130 [00:11<00:04, 8.24it/s]
Getting predictions: 71%|███████ | 92/130 [00:11<00:04, 8.25it/s]
Getting predictions: 72%|███████▏ | 93/130 [00:11<00:04, 8.32it/s]
Getting predictions: 72%|███████▏ | 94/130 [00:11<00:04, 8.44it/s]
Getting predictions: 73%|███████▎ | 95/130 [00:11<00:04, 7.84it/s]
Getting predictions: 74%|███████▍ | 96/130 [00:12<00:04, 7.96it/s]
Getting predictions: 75%|███████▍ | 97/130 [00:12<00:04, 8.10it/s]
Getting predictions: 75%|███████▌ | 98/130 [00:12<00:03, 8.08it/s]
Getting predictions: 76%|███████▌ | 99/130 [00:12<00:04, 7.18it/s]
Getting predictions: 77%|███████▋ | 100/130 [00:12<00:04, 7.47it/s]
Computing overlaps: 77%|███████▋ | 100/130 [00:12<00:04, 7.47it/s]
Computing overlaps: 86%|████████▌ | 112/130 [00:12<00:00, 34.73it/s]
Computing average precisions th = 0.50: 92%|█████████▏| 120/130 [00:12<00:00, 34.73it/s]
Computing average precisions th = 0.55: 93%|█████████▎| 121/130 [00:12<00:00, 34.73it/s]
Computing average precisions th = 0.60: 94%|█████████▍| 122/130 [00:12<00:00, 34.73it/s]
Computing average precisions th = 0.65: 95%|█████████▍| 123/130 [00:12<00:00, 34.73it/s]
Computing average precisions th = 0.65: 95%|█████████▌| 124/130 [00:12<00:00, 56.35it/s]
Computing average precisions th = 0.70: 95%|█████████▌| 124/130 [00:12<00:00, 56.35it/s]
Computing average precisions th = 0.75: 96%|█████████▌| 125/130 [00:12<00:00, 56.35it/s]
Computing average precisions th = 0.80: 97%|█████████▋| 126/130 [00:12<00:00, 56.35it/s]
Computing average precisions th = 0.85: 98%|█████████▊| 127/130 [00:12<00:00, 56.35it/s]
Computing average precisions th = 0.90: 98%|█████████▊| 128/130 [00:12<00:00, 56.35it/s]
Computing average precisions th = 0.95: 99%|█████████▉| 129/130 [00:12<00:00, 56.35it/s]
aeroplane 0.7300
bicycle 0.4417
bird 0.4833
boat 0.3070
bottle 0.2627
bus 0.7147
car 0.6889
cat 0.7670
chair 0.2726
cow 0.3700
diningtable 0.4711
dog 0.5736
horse 0.6147
motorbike 0.5083
person 0.4021
pottedplant 0.1094
sheep 0.2976
sofa 0.6283
train 0.6042
tvmonitor 0.5643
mAP 50: 0.8409
mAP 75: 0.5086
mAP: 0.4906
Akida inference on 100 images took 12.82 s.
6.2 Show predictions for a random image
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from akida_models.detection.processing import preprocess_image, decode_output
# Shuffle the data to take a random test image
val_dataset = val_dataset.shuffle(buffer_size=len_val_dataset)
input_shape = model_akida.layers[0].input_dims
# Load the image
raw_image = next(iter(val_dataset))['image']
# Keep the original image size for later bounding boxes rescaling
raw_height, raw_width, _ = raw_image.shape
# Pre-process the image
image = preprocess_image(raw_image, input_shape)
input_image = image[np.newaxis, :].astype(np.uint8)
# Call evaluate on the image
pots = model_akida.predict(input_image)[0]
# Reshape the potentials to prepare for decoding
h, w, c = pots.shape
pots = pots.reshape((h, w, len(anchors), 4 + 1 + len(labels)))
# Decode potentials into bounding boxes
raw_boxes = decode_output(pots, anchors, len(labels))
# Rescale boxes to the original image size
pred_boxes = np.array([[
box.x1 * raw_width, box.y1 * raw_height, box.x2 * raw_width,
box.y2 * raw_height,
box.get_label(),
box.get_score()
] for box in raw_boxes])
fig = plt.figure(num='VOC detection by Akida')
ax = fig.subplots(1)
img_plot = ax.imshow(np.zeros(raw_image.shape, dtype=np.uint8))
img_plot.set_data(raw_image)
for box in pred_boxes:
rect = patches.Rectangle((box[0], box[1]),
box[2] - box[0],
box[3] - box[1],
linewidth=1,
edgecolor='r',
facecolor='none')
ax.add_patch(rect)
class_score = ax.text(box[0],
box[1] - 5,
f"{labels[int(box[4])]} - {box[5]:.2f}",
color='red')
plt.axis('off')
plt.show()
Total running time of the script: (0 minutes 55.315 seconds)