Note
Go to the end to download the full example code.
Advanced QuantizeML tutorial
This tutorial provides a comprehensive understanding of quantization in QuantizeML python package. Refer to QuantizeML user guide and Global Akida workflow tutorial for additional resources.
QuantizeML python package provides a user-friendly collection of functions for obtaining a quantized model. The quantize function replaces Keras layers with quantized, integer only layers from QuantizeML.
1. Defining a quantization scheme
The quantization scheme refers to all the parameters used for quantization, that is the method of quantization such as per-axis or per-tensor, and the bitwidth used for inputs, outputs and weights.
The first part in this section explains how to define a quantization scheme using QuantizationParams, which defines a homogeneous scheme that applies to all layers, and the second part explains how to fully customize the quantization scheme using a configuration file.
1.1. The quantization parameters
The easiest way to customize quantization is to use the qparams
parameter of the quantize function. This is made
possible by creating a QuantizationParams object.
from quantizeml.models import QuantizationParams
qparams = QuantizationParams(input_weight_bits=8, weight_bits=8, activation_bits=8,
per_tensor_activations=False, output_bits=8, buffer_bits=32)
By default, the quantization scheme adopted is 8-bit with per-axis activations, but it is possible to set every parameter with a different value. The following list is a detailed description of the parameters with tips on how to set them:
input_weight_bits
is the bitwidth used to quantize weights of the first layer. It is usually set to 8 which allows to better preserve the overall accuracy.weight_bits
is the bitwidth used to quantize all other weights. It is usually set to 8 (Akida 2.0) or 4 (Akida 1.0).activation_bits
is the bitwidth used to quantize all ReLU activations. It is usually set to 8 (Akida 2.0) or 4 (Akida 1.0) but can be lower for edge learning (1-bit).per_tensor_activations
is a boolean that allows to define a per-axis (default) or per-tensor quantization for ReLU activations. Per-axis quantization will usually provide more accurate results (defaultFalse
value) but it might be more challenging to calibrate the model. Note that Akida 1.0 only supports per-tensor activations.output_bits
is the bitwidth used to quantize intermediate results in OutputQuantizer. Go back to the user guide quantization flow for details about this process.buffer_bits
is the maximum bitwidth allowed for low-level integer operations (e.g matrix multiplications). It is set to 32 and should not be changed as this is what the Akida hardware target will use.
Note
It is recommended to quantize a model to 8-bit or 4-bit to ensure it is Akida hardware compatible.
Warning
QuantizationParams
is only applied the first time a model is quantized.
If you want to re-quantize a model, you must to provide a complete q_config
.
1.2. Using a configuration file
Quantization can be further customized via a JSON configuration passed to the q_config
parameter of the quantize
function. This usage should be limited to targeted customization as writing a whole
configuration from scratch is really error prone. An example of targeted customization is to set
the quantization bitwidth of the output of a feature extractor to 1 which will allow edge learning.
Warning
When provided, the configuration file has priority over arguments. As a result however, the configuration file therefore must contain all parameters - you cannot rely on argument defaults to set non-specified values.
The following code snippets show what a configuration file looks like and how to edit it to customize quantization.
import keras
import json
from quantizeml.models import quantize, dump_config, QuantizationParams
# Define an example model with few layers to keep what follows readable
input = keras.layers.Input((28, 28, 3))
x = keras.layers.DepthwiseConv2D(kernel_size=3, name="dw_conv")(input)
x = keras.layers.Conv2D(filters=32, kernel_size=1, name="pw_conv")(x)
x = keras.layers.ReLU(name="relu")(x)
x = keras.layers.Dense(units=10, name="dense")(x)
model = keras.Model(input, x)
# Define QuantizationParams with specific values just for the sake of understanding the JSON
# configuration that follows.
qparams = QuantizationParams(input_weight_bits=16, weight_bits=4, activation_bits=6, output_bits=12,
per_tensor_activations=True, buffer_bits=24)
# Quantize the model
quantized_model = quantize(model, qparams=qparams)
quantized_model.summary()
1/1024 [..............................] - ETA: 2:21
59/1024 [>.............................] - ETA: 0s
119/1024 [==>...........................] - ETA: 0s
179/1024 [====>.........................] - ETA: 0s
240/1024 [======>.......................] - ETA: 0s
301/1024 [=======>......................] - ETA: 0s
363/1024 [=========>....................] - ETA: 0s
424/1024 [===========>..................] - ETA: 0s
485/1024 [=============>................] - ETA: 0s
546/1024 [==============>...............] - ETA: 0s
607/1024 [================>.............] - ETA: 0s
668/1024 [==================>...........] - ETA: 0s
728/1024 [====================>.........] - ETA: 0s
789/1024 [======================>.......] - ETA: 0s
849/1024 [=======================>......] - ETA: 0s
908/1024 [=========================>....] - ETA: 0s
969/1024 [===========================>..] - ETA: 0s
1024/1024 [==============================] - 1s 833us/step
Model: "model_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 3)] 0
dw_conv (QuantizedDepthwis (None, 26, 26, 3) 32
eConv2D)
pw_conv (QuantizedConv2D) (None, 26, 26, 32) 128
relu (QuantizedReLU) (None, 26, 26, 32) 2
dense (QuantizedDense) (None, 26, 26, 10) 330
dequantizer_1 (Dequantizer (None, 26, 26, 10) 0
)
=================================================================
Total params: 492 (1.92 KB)
Trainable params: 488 (1.91 KB)
Non-trainable params: 4 (16.00 Byte)
_________________________________________________________________
# Dump the configuration
config = dump_config(quantized_model)
# Display in a JSON format for readability
print(json.dumps(config, indent=4))
{
"dw_conv": {
"output_quantizer": {
"bitwidth": 12,
"axis": "per-tensor",
"buffer_bitwidth": 24
},
"weight_quantizer": {
"bitwidth": 16,
"axis": -2
},
"buffer_bitwidth": 24
},
"pw_conv": {
"weight_quantizer": {
"bitwidth": 4
},
"buffer_bitwidth": 24
},
"relu": {
"output_quantizer": {
"bitwidth": 6,
"signed": false,
"axis": "per-tensor",
"buffer_bitwidth": 24
},
"buffer_bitwidth": 24
},
"dense": {
"weight_quantizer": {
"bitwidth": 4
},
"buffer_bitwidth": 24
}
}
Explaining the above configuration:
the layer names are indexing the configuration dictionary.
the depthwise layer has an OutputQuantizer set to 12-bit (
output_bits=12
) to reduce intermediate potentials bitwidth before the pointwise layer that follows (automatically added when callingquantize
).the depthwise layer weights are quantized to 16-bit because it is the first layer (
input_weight_bits=16
) and are quantized per-axis (default for weights). The given axis is -2 because of Keras depthwise kernel shape that is (Kx, Ky, F, 1), channel dimension is at index -2.the pointwise layer has weights quantized to 4-bit (
weight_bits=4
) but the quantization axis is not specified as it defaults to -1 for a per-axis quantization. One would need to set it toNone
for a per-tensor quantization.the ReLU activation is quantized to 6-bit per-tensor (
activation_bits=6, per_tensor_activations=True
)all
buffer_bitwidth
are set to 24 (buffer_bits=24
)
The configuration will now be edited and used to quantize the float model with q_config
parameter.
# Edit the ReLU activation configuration
config["relu"]["output_quantizer"]['bitwidth'] = 1
config["relu"]["output_quantizer"]['axis'] = 'per-axis'
config["relu"]["output_quantizer"]['buffer_bitwidth'] = 32
config["relu"]['buffer_bitwidth'] = 32
# Drop other layers configurations
del config['dw_conv']
del config['pw_conv']
del config['dense']
# The configuration is now limited to the ReLU activation
print(json.dumps(config, indent=4))
{
"relu": {
"output_quantizer": {
"bitwidth": 1,
"signed": false,
"axis": "per-axis",
"buffer_bitwidth": 32
},
"buffer_bitwidth": 32
}
}
Now quantize with setting both qparams
and q_config
parameters: the activation will be
quantized using the given configuration and the other layers will use what is provided in
qparams
.
new_quantized_model = quantize(model, q_config=config, qparams=qparams)
# Dump the new configuration
new_config = dump_config(new_quantized_model)
# Display in a JSON format for readability
print(json.dumps(new_config, indent=4))
1/1024 [..............................] - ETA: 1:43
59/1024 [>.............................] - ETA: 0s
120/1024 [==>...........................] - ETA: 0s
179/1024 [====>.........................] - ETA: 0s
240/1024 [======>.......................] - ETA: 0s
300/1024 [=======>......................] - ETA: 0s
360/1024 [=========>....................] - ETA: 0s
420/1024 [===========>..................] - ETA: 0s
480/1024 [=============>................] - ETA: 0s
541/1024 [==============>...............] - ETA: 0s
602/1024 [================>.............] - ETA: 0s
663/1024 [==================>...........] - ETA: 0s
723/1024 [====================>.........] - ETA: 0s
784/1024 [=====================>........] - ETA: 0s
844/1024 [=======================>......] - ETA: 0s
904/1024 [=========================>....] - ETA: 0s
963/1024 [===========================>..] - ETA: 0s
1024/1024 [==============================] - ETA: 0s
1024/1024 [==============================] - 1s 837us/step
{
"dw_conv": {
"output_quantizer": {
"bitwidth": 12,
"axis": "per-tensor",
"buffer_bitwidth": 24
},
"weight_quantizer": {
"bitwidth": 16,
"axis": -2
},
"buffer_bitwidth": 24
},
"pw_conv": {
"weight_quantizer": {
"bitwidth": 4
},
"buffer_bitwidth": 24
},
"relu": {
"output_quantizer": {
"bitwidth": 1,
"signed": false,
"axis": "per-axis",
"buffer_bitwidth": 32
},
"buffer_bitwidth": 32
},
"dense": {
"weight_quantizer": {
"bitwidth": 4
},
"buffer_bitwidth": 24
}
}
The new configuration contains both the manually set configuration in the activation and the parameters defined configuration for other layers.
2. Calibration
2.1. Why is calibration required?
OutputQuantizer are added between layer blocks during quantization in order to decrease intermediate potential bitwidth and prevent saturation. Calibration is the process of defining the best quantization range possible for the OutputQuantizer.
Calibration will statistically determine the quantization range by passing samples into the float
model and observing the intermediate output values. The quantization range is stored in
range_max
variable. The calibration algorithm used in QuantizeML is based on a moving maximum:
range_max
is initialized with the maximum value of the first batch of samples (per-axis or
per-tensor depending on the quantization scheme) and the following batches will update
range_max
with a moving momentum strategy (momentum is set to 0.9). Refer to the following
pseudo code:
samples_max = reduce_max(samples)
delta = previous_range_max - new_range_max * (1 - momentum)
new_range_max = previous_range_max - delta
In QuantizeML like in other frameworks, the calibration process happens simultaneously
with quantization and the quantize function thus comes with
calibration parameters: samples
, num_samples
, batch_size
and epochs
. Sections
below describe how to set these parameters.
Note
Calibration does not require any label or sample annotation and is therefore different from training.
2.2. The samples
There are two types of calibration samples: randomly generated samples or real samples.
When the samples
parameter of quantize
is left to the default None
value, random
samples will be generated using the num_samples
value (default is 1024). When the model input
shape has 1 or 3 channels, which corresponds to an image, the random samples value are unsigned
8-bit integers in the [0, 255] range. If the channel dimension is not 1 or 3, the generated
samples are 8-bit signed integers in the [-128, 127] range.
If that does not correspond to the range expected by your model, either add a Rescaling layer to your model
using the insert_rescaling helper or
provide real samples.
Real samples are often (but not necessarily) taken from the training dataset and should be the preferred option for calibration as it will always lead to better results.
Samples are batched before being passed to the model for calibration. It is recommended to use at
least 1024 samples for calibration. When providing samples, num_samples
is only used to
compute the number of steps during calibration.
if batch_size is None:
steps = num_samples
else:
steps = np.ceil(num_samples / batch_size)
2.3. Other calibration parameters
batch_size
Setting a large enough batch_size
is important as it will impact range_max
initialization
that is made on the first batch of samples. The recommended value is 100.
epochs
It is the number of iterations over the calibration samples. Increasing the value will allow for
more updates of the range_max
variables thanks to the momentum policy without requiring a huge
amount of samples. The recommended value is 2.
Total running time of the script: (0 minutes 5.716 seconds)