QuantizeML toolkit

Overview

QuantizeML package provides base layers and quantization tools for deep-learning models. It allows the quantization of CNN and Vision Transformer models using low-bitwidth weights and outputs. Once quantized with the provided tools, CNN2SNN toolkit will be able to convert the model and execute it with Akida runtime.

The FixedPoint representation

QuantizeML uses a FixedPoint representation in place of float values for layers inputs, outputs and weights.

FixedPoint numbers are actually integers with a static number of fractional bits so that:

\[x_{float} \approx x_{int}.2^{-x_{frac\_bits}}\]

The precision of the representation is directly related to the number of fractional bits. For example, representing PI using an 8-bit FixedPoint with varying fractional bits:

frac_bits	x_int	float value
1	6	3.0
3	25	3.125
6	201	3.140625

Further details are available in the FixedPoint API documentation.

Thanks to the FixedPoint representation, all operations within layers are implemented as integer only operations 1.

Quantization flow

The first step in the workflow is to train a standard Keras model. This trained model is the starting point for the quantization stage. Once it is established that the overall model configuration prior to quantization yields a satisfactory performance on the task, one can proceed with quantization.

Let’s take the DS-CNN model from our zoo that targets KWS task as an example:

from akida_models import fetch_file
from quantizeml.models import load_model
model_file = fetch_file("https://data.brainchip.com/models/AkidaV2/ds_cnn/ds_cnn_kws.h5",
                        fname="ds_cnn_kws.h5")
model = load_model(model_file)

The QuantizeML toolkit offers a turnkey solution to quantize a model: the quantize function. It replaces the Keras layers (or custom QuantizeML layers) with quantized, integer only layers. The obtained quantized model is still a Keras model that can be evaluated with a standard Keras pipeline.

The quantization scheme used by quantize can be configured using QuantizationParams. If none is given, an 8-bit configuration scheme will be selected.

Here’s an example for 8-bit quantization:

from quantizeml.layers import QuantizationParams
qparams8 = QuantizationParams(input_weight_bits=8, weight_bits=8, activation_bits=8)

Here’s an example for 4-bit quantization (with first layer weights set to 8-bit):

from quantizeml.layers import QuantizationParams
qparams4 = QuantizationParams(input_weight_bits=8, weight_bits=4, activation_bits=4)

Note that quantizating the first weights to 8-bit helps preserving accuracy.

QuantizeML uses a uniform quantization scheme centered on zero. During quantization, the floating point values are mapped to a given bitwidth quantization space of the form:

\[data_{float32} = data_{fixed\_point} * scales\]

scales is a real number used to map the FixedPoint numbers to a quantization space. It is calculated as follows:

\[scales = \frac {max(abs(data))}{2^{bitwidth} - 1}\]

Inputs, weights and outputs scales are folded into a single output scale vector.

To avoid saturation in downstream operations throughout a model graph, the bitwidth of intermediary results is decreased using OutputQuantizer. The quantize function has built-in rules to automatically isolate building blocks of layers after which such quantization is required and will insert the OutputQuantizer objects during the quantization process.

To properly operate, an OutputQuantizer must be calibrated so that it determines an adequate quantization range. Calibration will determine the quantization range statistically. It is possible to pass down samples to the quantize function so that calibration and quantization are performed simultaneously.

Calibration samples are available on Brainchip data server for datasets used in our zoo. They must be downloaded and deserialized before being used for calibration.

import numpy as np
from akida_models import fetch_file
samples = fetch_file("https://data.brainchip.com/dataset-mirror/samples/kws/kws_batch1024.npz",
                     fname="kws_batch1024.npz")
samples = np.load(samples)
samples = np.concatenate([samples[item] for item in samples.files])

Quantizing the DS-CNN model to 8-bit is then done with:

from quantizeml.models import quantize
quantized_model = quantize(model, qparams=qparams8, samples=samples)

Please refer to calibrate for more details on calibration.

Direct quantization of a standard Keras model (also called Post Training Quantization, PTQ) generally introduces a drop in performance. This drop is usually small for 8-bit or even 4-bit quantization of simple models, but it can be very significant for low quantization bitwidth and complex models (AkidaNet or transformers architectures).

If the quantized model offers acceptable performance, it can be directly converted into an Akida model (see the convert function).

However, if the performance drop is too high, a Quantization Aware Training (QAT) step is required to recover the performance prior to quantization. Since the quantized model is a Keras model, it can then be trained using the standard Keras API.

Check out the examples section for tutorials on quantization, PTQ and QAT.

Compatibility constraints

The tookit supports a wide range of layers (see the supported type section). When hitting a non-compatible layer, QuantizeML will simply stop the quantization before this layer and add a Dequantizer before it so that inference is still possible. When such an event occurs, a warning is raised to the user with the faulty layer name.

While quantization comes with some restrictions on layer order (e.g. MaxPool2D operation should be placed before ReLU activation), the sanitize helper is called before quantization to deal with such restrictions and edit the model accordingly. sanitize will also handle some layers that are not in the supported layer types such as:

ZeroPadding2D which is replaced with ‘same’ padding convolution when possible
Lambda layers:
- Lambda(relu) or Activation(‘relu’) → ReLU,
- Lambda(transpose) → Permute,
- Lambda(reshape) → Reshape,
- Lambda(add) → Add.

Model loading

The toolkit offers a keras.models.load_model wrapper that allows to load models with quantized layers: quantizeml.models.load_model

Command line interface

In addition to the programming interface, QuantizeML toolkit also provides a command-line interface to perform quantization, dump a quantized model configuration, check a quantized model and insert a rescaling layer.

quantize CLI

Quantizing a model through the CLI uses almost the same arguments as the programming interface but the quantization parameters are split into the parameters: input weight quantization with “-i”, weight bitwidth with “-w” and activation bitwidth with the “-a” options.

quantizeml quantize -m model_keras.h5 -i 8 -w 8 -a 8

Note that without calibration options explicitly given, calibration will happen with 1024 randomly generated samples. It is generally advised to use real samples serialized in a numpy .npz file.

quantizeml quantize -m model_keras.h5 -i 8 -w 8 -a 8 -sa some_samples.npz -bs 128 -e 2

For akida 1.0 compatibility, it is mandatory to have activations quantized per-tensor instead of the default per-axis quantization:

quantizeml quantize -m model_keras.h5 -i 8 -w 4 -a 4 --per_tensor_activations

config CLI

Advanced users might want to customize the default quantization pattern and this is made possible by dumping a quantized model configuration to a .json file and quantizing again using the “-c” option.

quantizeml config -m model_keras_i8_w8_a8.h5 -o config.json

... manual configuration changes ...

quantizeml quantize -m model_keras.h5 -c config.json

Warning

Editing a model configuration can be complicated and might have negative effects on quantized accuracy or even model graph. This should be reserved to users deeply familiar with QuantizeML concepts.

check CLI

It is possible to check for quantization errors using the check CLI that will report inaccurate weight scales quantization or saturation in integer operations.

quantizeml check -m model_keras_i8_w8_a8.h5

insert_rescaling CLI

Some models might not include a Rescaling layer in their architecture and have a separated preprocessing pipeline (ie. moving from [0, 255] images to a [-1, 1] normalized representation). As having a rescaling layer might be useful, QuantizeML offers the insert_rescaling CLI that will add a Rescaling layer at the beginning of a given model.

quantizeml insert_rescaling -m model_keras.h5 -s 0.007843 -o -1 -d model_updated.h5

where \(0.007843 = 1/127.5\).

Supported layer types

The QuantizeML toolkit provides quantization of the following layer types which are standard Keras layers for most part and custom QuantizeML layers for some of them:

Neural layers
- Conv2D
- Conv2DTranspose
- DepthwiseConv2D
- DepthwiseConv2DTranspose (custom QuantizeML layer)
- SeparableConv2D
- Dense
Transformers
- Attention (custom QuantizeML layer)
- ClassToken (custom QuantizeML layer)
- AddPositionEmbs (custom QuantizeML layer)
- ExtractToken (custom QuantizeML layer)
Skip connections
- Add
- Concatenate
Normalization
- BatchNormalization
- LayerMadNormalization (custom QuantizeML layer)
Activations
- ReLU (both unbounded and with a max value)
- Shiftmax (custom QuantizeML layer)
Pooling
- MaxPool2D
- GlobalAveragePooling2D
Reshaping
- Flatten
- Permute
- Reshape
Others
- Rescaling
- Dropout

1: See https://en.wikipedia.org/wiki/Fixed-point_arithmetic for more details on the arithmetics.