Overview¶

In this notebook, we will go over the most important objects and classes in fastforward. At the end of the notebook, we will have covered how to quantize simple modules like a multi-layer perceptron as well as the large language model OPT. This is a great start if you want to familiarize yourself with fastforward.

The notebook consists of five sections:

Quantized Tensors: QuantizedTensors, a subclass of torch.Tensors which are the fundamental datatype in FastForward.
Quantizers: Quantizers are torch.nn.Modules that turn floating point tensors into QuantizedTensors and can learn from data.
Quantized Modules: Quantizing a module consists of three steps: 1) Changing the module to a QuantizedModule, 2. inserting desired quantizers, and 3. estimating the ranges for each quantizers.
Quantized Models: How to automate the first steps described above using 1) the quantize_model function and 2) the QuantizationConfig. This section also shows how to manually quantize custom and 3rd party modules.
Quantizing 3rd Party models: We show how we applied all of the above to quantize the huggingface OPT model.

In [1]:

Copied!

import copy

from pprint import pprint

import torch

import fastforward as ff
import copy

from pprint import pprint

import torch

import fastforward as ff

1. Quantized Tensors¶

We start our tutorial with the ff.quantized_tensor.QuantizedTensor. This datatype is a subclass of a torch.Tensor designed to hold quantized data. A QuantizedTensor can be quantized using any type of quantization (uniform quantization, dynamic quantization, vector quantization, ...) but we will focus on linear / uniform quantization.

There are many kinds of quantization, but in this notebook we focus on integer quantization on a fixed per-tensor or per-channel grid.

It is not required that you know the details of this (very common) quantization scheme, but if you want to know more please refer to A White Paper on Neural Network Quantization (Nagel et al. 2021)

⏩ Let's start by creating some floating point data

In [2]:

Copied!

in_features = 4

data = torch.rand(1, in_features) - 0.5
data
in_features = 4

data = torch.rand(1, in_features) - 0.5
data

Out[2]:

tensor([[0.1579, 0.1376, 0.3846, 0.4860]])

⏩ Now, we quantize the data using 8-bit per-tensor quantization.

In [3]:

Copied!

scale = torch.tensor([0.1])
num_bits = 8
quantized_data = ff.quantization.affine.quantize_per_tensor(data, num_bits=num_bits, scale=scale)

quantized_data
scale = torch.tensor([0.1])
num_bits = 8
quantized_data = ff.quantization.affine.quantize_per_tensor(data, num_bits=num_bits, scale=scale)

quantized_data

Out[3]:

QuantizedTensor([[2., 1., 4., 5.]],
                quantizer=TiledAffineQuantizationFunction, scale=tensor([0.1000]), tile_size=torch.Size([1, 4]), num_bits=8, output_dtype=torch.float32, offset=None)

✅ We can see that quantized_data is now a QuantizedTensor. This makes it very easy to see in FastForward if your data is actually quantized or not.

✅ This tensor both holds the actual data (of same shape as data) as well as the hyperparameters of the quantizer. For this specific case the only hyperparameter is the quantization scale, which we have set manually.

❌ Because this data is transformed to a new coordinate system, it is not easy to see what floating point values they represent.

⏩ For this purpose, we can dequantize the tensor, which we do below.

In [4]:

Copied!

quantized_data.dequantize()
quantized_data.dequantize()

Out[4]:

tensor([[0.2000, 0.1000, 0.4000, 0.5000]])

2. Quantizers¶

In the previous paragraph, we saw above how a floating point tensor can be quantized.

❌ Quantization often involves hyperparameters which we do not know in advance.

✅ For this purpose, we can use ff.nn.Quantizers. These are nn.Modules that can quantize a data in their forward pass and can also be used to estimate or learn the quantization hyperparameters.

⏩ We create a LinearQuantizer now

In [5]:

Copied!

quantizer = ff.nn.linear_quantizer.LinearQuantizer(num_bits=2)
quantizer
quantizer = ff.nn.linear_quantizer.LinearQuantizer(num_bits=2)
quantizer

Out[5]:

LinearQuantizer(num_bits=2, symmetric=True, granularity=PerTensor())

⏩ Next, we try to quantize our data with our quantizer.

In [6]:

Copied!





try:
    quantizer(data)
except ValueError as e:
    print("[ERROR]", e, "\n")

print(f"{quantizer.has_uninitialized_params=}")
print(f"{quantizer.quantization_range=}")  # min, max values that quantizer can represent.
try:
    quantizer(data)
except ValueError as e:
    print("[ERROR]", e, "\n")

print(f"{quantizer.has_uninitialized_params=}")
print(f"{quantizer.quantization_range=}")  # min, max values that quantizer can represent.

[ERROR] Tried to quantize a tensor using an uninitialized quantizer (of type LinearQuantizer). This quantizer is initialized after its quantization_range is specified. This can be done explicitly by using the LinearQuantizer.quantization_range setter or using a range setting method. 

quantizer.has_uninitialized_params=True
quantizer.quantization_range=(None, None)

❌ We can see that our quantizer will not quantize any data just yet. The reason for this is that this specific quantizer has hyperparameters that need to be fitted before we can quantize any data. As a result, the quantization range is not yet set.

⏩ We could set quantizer.quantization_range directly, but we would need to know the desired quantization range (min, max).

⏩ A more common approach is to use range estimation to find the optimal range based on data. We do this below.

In [7]:

Copied!





with ff.range_setting.estimate_ranges(quantizer, ff.range_setting.smoothed_minmax):
    quantizer(data)

print(f"{quantizer.has_uninitialized_params=}")
print(f"{quantizer.quantization_range=}")
print(f"{data.min()=} {data.max()=}")
with ff.range_setting.estimate_ranges(quantizer, ff.range_setting.smoothed_minmax):
    quantizer(data)

print(f"{quantizer.has_uninitialized_params=}")
print(f"{quantizer.quantization_range=}")
print(f"{data.min()=} {data.max()=}")

quantizer.has_uninitialized_params=False
quantizer.quantization_range=(tensor([0.], grad_fn=<MulBackward0>), tensor([0.4860], grad_fn=<MulBackward0>))
data.min()=tensor(0.1376) data.max()=tensor(0.4860)

✅ We have now set the quantization range and the quantizer is initialized.

✅ We can see that the quantization range is the same as the range in the data batch.

⏩ We will now use our quantizer to quantize the data

In [8]:

Copied!

quantized_data = quantizer(data)  # type: ignore[assignment]

quantized_data
quantized_data = quantizer(data)  # type: ignore[assignment]

quantized_data

Out[8]:

QuantizedTensor([[-1., -1.,  0.,  1.]], grad_fn=<AliasBackward0>,
                quantizer=TiledAffineQuantizationFunction, scale=tensor([0.1620], grad_fn=<TiledAffineQuantizationFunctionFunctionBackward>), tile_size=data_shape, num_bits=2, output_dtype=torch.float32, offset=tensor([2.], grad_fn=<TiledAffineQuantizationFunctionFunctionBackward>))

3. Quantized Modules¶

Quantizers typically don't exist in isolation. Instead, we would often like to quantize an entire model. In this section we show how to turn a module into a quantized module and what is happening under the hood. In the next section we show how to use convience methods for easier quantization of bigger models.

⏩ We start with a simple unquantized linear layer

In [9]:

Copied!

out_features = 8

unquantized_linear = torch.nn.Linear(in_features, out_features)
print(unquantized_linear)
out_features = 8

unquantized_linear = torch.nn.Linear(in_features, out_features)
print(unquantized_linear)

Linear(in_features=4, out_features=8, bias=True)

⏩ In FastForward we use ff.nn.QuantizedModules, they are drop-in replacements of torch.nn.Modules but additionally take care of quantization.

⏩ Most modules in torch.nn are mirrored with their quantized counterpart in ff.nn

The goal of these quantized modules is that they behave exactly the same as their floating point counterparts, they have the same methods which have the same function signatures.
The only difference is that we add quantizer children to the modules and we change the forward pass s.t. it performs quantized operations instead of floating point operations.
⚠️ If you do not find your desired module in ff.nn, you can either open an issue with us, or implement the layer yourself.

⏩ Let a closer look at the ff.nn.QuantizedLinear below.

For now, we manually copied the weight data s.t. the quantized_linear matches the unquantized_linear. In the next section we will show convenience methods to automatically quantize modules in-place.

In [10]:

Copied!

quantized_linear = ff.nn.QuantizedLinear(in_features, out_features)
quantized_linear.weight.data = unquantized_linear.weight.data.clone()
quantized_linear.bias.data = unquantized_linear.bias.data.clone()

print(quantized_linear)
quantized_linear = ff.nn.QuantizedLinear(in_features, out_features)
quantized_linear.weight.data = unquantized_linear.weight.data.clone()
quantized_linear.bias.data = unquantized_linear.bias.data.clone()

print(quantized_linear)

QuantizedLinear(
  in_features=4, out_features=8, bias=True
  (input_quantizer): QuantizerStub()
  (weight_quantizer): QuantizerStub()
  (bias_quantizer): QuantizerStub()
  (output_quantizer): QuantizerStub()
)

⏩ We can see that our QuantizedLinear has the same representation as the Linear, but instead there are four quantizer children added.

In this case the bias_quantizer is None since this layer does not have a bias.

⏩ Observe that all quantizers are set to be QuantizerStubs. These are no-op placeholders that can be repaced with quantizers if desired.

⏩ Let's try to push data trough our QuantizedLinear.

In [11]:

Copied!





try:
    quantized_output = quantized_linear(data)
except ff.exceptions.QuantizationError as e:
    print("[ERROR]", e, "\n")
try:
    quantized_output = quantized_linear(data)
except ff.exceptions.QuantizationError as e:
    print("[ERROR]", e, "\n")

[ERROR] Expected 'input' to be an instance of 'QuantizedTensor' because strict_quantization=True.

❌ We see we cannot push data trough our QuantizedLinear because strict_quantization=True.

This flag tries to catch a common error-case in simulated quantization where no quantization is taking place, and the user is not aware of this.
In our case, we have not assigned any quantizers, so the layer will behave as a floating point layer, which is not allowed if strict_quantization=True.

⏩ Let's temporarily disable the strict_quantization setting and see what happens when we call the quantized_linear.

In [12]:

Copied!

with ff.strict_quantization(False):
    quantized_output = quantized_linear(data)

unquantized_output = unquantized_linear(data)

print(f"{unquantized_output=}")
print(f"{quantized_output=}")
with ff.strict_quantization(False):
    quantized_output = quantized_linear(data)

unquantized_output = unquantized_linear(data)

print(f"{unquantized_output=}")
print(f"{quantized_output=}")

unquantized_output=tensor([[-0.1949, -0.5414, -0.5109,  0.0402, -0.1622,  0.2810,  0.3713, -0.3943]],
       grad_fn=<AddmmBackward0>)
quantized_output=tensor([[-0.1949, -0.5414, -0.5109,  0.0402, -0.1622,  0.2810,  0.3713, -0.3943]],
       grad_fn=<AddmmBackward0>)

✅ Indeed, the quantized_linear is behaving exactly as the unquantized_linear as we have not specified any quantizers.

⏩ We will now assign quantizers to all of the quantizer fields

In [13]:

Copied!





quantized_linear.input_quantizer = ff.nn.linear_quantizer.LinearQuantizer(num_bits=2)
quantized_linear.weight_quantizer = ff.nn.linear_quantizer.LinearQuantizer(num_bits=2)
quantized_linear.output_quantizer = ff.nn.linear_quantizer.LinearQuantizer(num_bits=2)
print(quantized_linear)
quantized_linear.input_quantizer = ff.nn.linear_quantizer.LinearQuantizer(num_bits=2)
quantized_linear.weight_quantizer = ff.nn.linear_quantizer.LinearQuantizer(num_bits=2)
quantized_linear.output_quantizer = ff.nn.linear_quantizer.LinearQuantizer(num_bits=2)
print(quantized_linear)

QuantizedLinear(
  in_features=4, out_features=8, bias=True
  (input_quantizer): LinearQuantizer(num_bits=2, symmetric=True, granularity=PerTensor())
  (weight_quantizer): LinearQuantizer(num_bits=2, symmetric=True, granularity=PerTensor())
  (bias_quantizer): QuantizerStub()
  (output_quantizer): LinearQuantizer(num_bits=2, symmetric=True, granularity=PerTensor())
)

⏩ As we know from the example above, we first need to initialize the quantizers by passing data trough. Let's do so.

In [14]:

Copied!





print("Before range estimation")
print(f"{quantized_linear.input_quantizer.quantization_range=}")
print(f"{quantized_linear.weight_quantizer.quantization_range=}")
print(f"{quantized_linear.output_quantizer.quantization_range=}")
print()


with ff.range_setting.estimate_ranges(quantized_linear, ff.range_setting.smoothed_minmax):
    quantized_linear(data)

print("After range estimation")
print(f"{quantized_linear.input_quantizer.quantization_range=}")
print(f"{quantized_linear.weight_quantizer.quantization_range=}")
print(f"{quantized_linear.output_quantizer.quantization_range=}")
print()
print("Before range estimation")
print(f"{quantized_linear.input_quantizer.quantization_range=}")
print(f"{quantized_linear.weight_quantizer.quantization_range=}")
print(f"{quantized_linear.output_quantizer.quantization_range=}")
print()


with ff.range_setting.estimate_ranges(quantized_linear, ff.range_setting.smoothed_minmax):
    quantized_linear(data)

print("After range estimation")
print(f"{quantized_linear.input_quantizer.quantization_range=}")
print(f"{quantized_linear.weight_quantizer.quantization_range=}")
print(f"{quantized_linear.output_quantizer.quantization_range=}")
print()

Before range estimation
quantized_linear.input_quantizer.quantization_range=(None, None)
quantized_linear.weight_quantizer.quantization_range=(None, None)
quantized_linear.output_quantizer.quantization_range=(None, None)

After range estimation
quantized_linear.input_quantizer.quantization_range=(tensor([0.], grad_fn=<MulBackward0>), tensor([0.4860], grad_fn=<MulBackward0>))
quantized_linear.weight_quantizer.quantization_range=(tensor([-0.8725], grad_fn=<MulBackward0>), tensor([0.4362], grad_fn=<MulBackward0>))
quantized_linear.output_quantizer.quantization_range=(tensor([-0.8864], grad_fn=<MulBackward0>), tensor([0.4432], grad_fn=<MulBackward0>))

✅ We can see that all the quantizers in our layer are initialized now.

⏩ We should now be able to call our quantized_linear. Let's do that!

In [15]:

Copied!





unquantized_output = unquantized_linear(data)
quantized_output = quantized_linear(data)

print(f"{unquantized_output=}")
print()
print(f"{quantized_output=}")
print()
print(f"{quantized_output.dequantize()=}")
unquantized_output = unquantized_linear(data)
quantized_output = quantized_linear(data)

print(f"{unquantized_output=}")
print()
print(f"{quantized_output=}")
print()
print(f"{quantized_output.dequantize()=}")

unquantized_output=tensor([[-0.1949, -0.5414, -0.5109,  0.0402, -0.1622,  0.2810,  0.3713, -0.3943]],
       grad_fn=<AddmmBackward0>)

quantized_output=QuantizedTensor([[-1., -1., -1.,  0., -0.,  1.,  1., -1.]],
                grad_fn=<AliasBackward0>,
                quantizer=TiledAffineQuantizationFunction, scale=tensor([0.4432], grad_fn=<TiledAffineQuantizationFunctionFunctionBackward>), tile_size=data_shape, num_bits=2, output_dtype=torch.float32, offset=tensor([0.], grad_fn=<TiledAffineQuantizationFunctionFunctionBackward>))

quantized_output.dequantize()=tensor([[-0.4432, -0.4432, -0.4432,  0.0000,  0.0000,  0.4432,  0.4432, -0.4432]],
       grad_fn=<STEAutogradFuncBackward>)

✅ We can now see that our quantized_linear is behaving as expected:

The output is a QuantizedTensor
The dequantized output is close to the floating point output, but it is not identical due to quantization error.

4. Quantized Models¶

In the previous section we showed how to quantize a module:

Turn an unquantized module into an unquantized module
Replace the desired QuantizerStubs with the desired Quantizers
Estimate the quantizer ranges by passing data trough the model.

Performing step 1. and 2. were quite laborious in our above example. Since we have to repeat these steps for every layer in the model, we have created helper tools to automatate these tasks. In the next section we will show how to use autoquant tool to automatically replace all layers with their Quantized counterparts (step 1.) and how to use the QuantizationConfig to automatically insert quantizers into the model (step 2.).

⏩ We start by making a simple unquantized MLP model.

In [16]:

Copied!





hidden_features = 3

unquantized_model = torch.nn.Sequential(
    torch.nn.Linear(in_features, hidden_features),
    torch.nn.ReLU(),
    torch.nn.Linear(hidden_features, hidden_features),
    torch.nn.ReLU(),
    torch.nn.Linear(hidden_features, out_features),
    torch.nn.ReLU(),
)

unquantized_model
hidden_features = 3

unquantized_model = torch.nn.Sequential(
    torch.nn.Linear(in_features, hidden_features),
    torch.nn.ReLU(),
    torch.nn.Linear(hidden_features, hidden_features),
    torch.nn.ReLU(),
    torch.nn.Linear(hidden_features, out_features),
    torch.nn.ReLU(),
)

unquantized_model

Out[16]:

Sequential(
  (0): Linear(in_features=4, out_features=3, bias=True)
  (1): ReLU()
  (2): Linear(in_features=3, out_features=3, bias=True)
  (3): ReLU()
  (4): Linear(in_features=3, out_features=8, bias=True)
  (5): ReLU()
)

4.1 Automatically replace `torch.nn.Modules` with their `ff.nn.QuantizedModule` counterparts¶

⏩ The quantize_model function can change a model in-place, recursively replacing all modules with their QuantizedModule counterparts.

The way this function works is pretty simple. It just uses a dict that maps torch.nn.Module types to ff.nn.QuantizedModule types.

⏩ Let's have a look at this dictionary

In [17]:

Copied!

ff.quantized_module_map()
ff.quantized_module_map()

Out[17]:

{torch.nn.modules.activation.ReLU: fastforward.nn.activations.QuantizedRelu,
 torch.nn.modules.activation.SiLU: fastforward.nn.activations.QuantizedSilu,
 torch.nn.modules.container.Sequential: fastforward.nn.container.QuantizedSequential,
 torch.nn.modules.container.ModuleList: fastforward.nn.container.QuantizedModuleList,
 torch.nn.modules.container.ModuleDict: fastforward.nn.container.QuantizedModuleDict,
 torch.nn.modules.container.ParameterList: fastforward.nn.container.QuantizedParameterList,
 torch.nn.modules.container.ParameterDict: fastforward.nn.container.QuantizedParameterDict,
 torch.nn.modules.conv.Conv2d: fastforward.nn.conv.QuantizedConv2d,
 torch.nn.modules.conv.Conv1d: fastforward.nn.conv.QuantizedConv1d,
 torch.nn.modules.sparse.Embedding: fastforward.nn.embedding.QuantizedEmbedding,
 torch.nn.modules.linear.Linear: fastforward.nn.linear.QuantizedLinear,
 torch.nn.modules.normalization.LayerNorm: fastforward.nn.normalization.QuantizedLayerNorm}

⏩ We will now run the quantize_model function.

⚠️ Note that the ff.nn.quantize_model changes the module classes in-place!

Because we want to keep access to the full precision network for our comparison we first deepcopy the floating point model.

In [18]:

Copied!

quantized_model = copy.deepcopy(unquantized_model)
ff.quantize_model(quantized_model)
quantized_model
quantized_model = copy.deepcopy(unquantized_model)
ff.quantize_model(quantized_model)
quantized_model

Out[18]:

QuantizedSequential(
  (0): QuantizedLinear(
    in_features=4, out_features=3, bias=True
    (input_quantizer): QuantizerStub()
    (weight_quantizer): QuantizerStub()
    (bias_quantizer): QuantizerStub()
    (output_quantizer): QuantizerStub()
  )
  (1): QuantizedRelu(
    (input_quantizer): QuantizerStub()
    (output_quantizer): QuantizerStub()
  )
  (2): QuantizedLinear(
    in_features=3, out_features=3, bias=True
    (input_quantizer): QuantizerStub()
    (weight_quantizer): QuantizerStub()
    (bias_quantizer): QuantizerStub()
    (output_quantizer): QuantizerStub()
  )
  (3): QuantizedRelu(
    (input_quantizer): QuantizerStub()
    (output_quantizer): QuantizerStub()
  )
  (4): QuantizedLinear(
    in_features=3, out_features=8, bias=True
    (input_quantizer): QuantizerStub()
    (weight_quantizer): QuantizerStub()
    (bias_quantizer): QuantizerStub()
    (output_quantizer): QuantizerStub()
  )
  (5): QuantizedRelu(
    (input_quantizer): QuantizerStub()
    (output_quantizer): QuantizerStub()
  )
)

✅ We see that all modules in the model are now replaced with their quantized counterpart.

⏩ Since no quantizers are inserted yet, let's confirm that the quantized_model still behaves the same as the unquantized_model.

In [19]:

Copied!

with ff.strict_quantization(False):
    quantized_output = quantized_model(data)

unquantized_output = unquantized_model(data)

print(f"{unquantized_output=}")
print(f"{quantized_output=}")
with ff.strict_quantization(False):
    quantized_output = quantized_model(data)

unquantized_output = unquantized_model(data)

print(f"{unquantized_output=}")
print(f"{quantized_output=}")

unquantized_output=tensor([[0.0795, 0.0000, 0.4139, 0.0000, 0.4434, 0.0000, 0.0000, 0.0938]],
       grad_fn=<ReluBackward0>)
quantized_output=tensor([[0.0795, 0.0000, 0.4139, 0.0000, 0.4434, 0.0000, 0.0000, 0.0938]],
       grad_fn=<ReluBackward0>)

4.2 Automatically inserting `ff.nn.Quantizer`s in the right place in the model.¶

The ff.QuantizationConfig is a tool to automatically replace QuantizerStubs with Quantizers. It works by adding quantization rules.

A quantization rule consists of two components:

A query determines to which layers the rule should be applied. Filtering is done using the ff.mpath library. Please see the tutorial on MPath for more information.
A quantizer class or factory. This determines how the quantizer is created. In the case of a Quantizer class, for each match a quantizer of that class is initialized using the provided kwargs. In the case of a factory function, the function receives the full name of the quantizer and the current quantizer at that location. This function is expected to return an intiailized quantizer.

If multiple rules match a single quantizer, the rule that was added last takes priority.

⏩ We create our QuantizationConfig below, see if you can understand all the rules!

In [20]:

Copied!





config = ff.QuantizationConfig()

# We want to quantize all weights in the model.
config.add_rule(
    "**/[quantizer:parameter/weight]",
    ff.nn.LinearQuantizer,
    num_bits=8,
    symmetric=True,
    granularity=ff.PerChannel(),
)

# We want to quantize all the outputs in the model, too.
config.add_rule(
    "**/[quantizer:activation/output]",
    ff.nn.LinearQuantizer,
    num_bits=8,
    symmetric=False,
    granularity=ff.PerTensor(),
)


# We only want to enable the input quantizer of the first layer, so that we can turn a floating point input into a quantized input.
# For the subsequent layers, the input will already be quantized because there will be an output quantizer in the layer before that.
def input_factory(name: str, current_quantizer: ff.nn.Quantizer) -> ff.nn.Quantizer:
    return ff.nn.LinearQuantizer(num_bits=8, symmetric=False, granularity=ff.PerTensor())


config.add_rule("0/[quantizer:activation/input]", input_factory)

config
config = ff.QuantizationConfig()

# We want to quantize all weights in the model.
config.add_rule(
    "**/[quantizer:parameter/weight]",
    ff.nn.LinearQuantizer,
    num_bits=8,
    symmetric=True,
    granularity=ff.PerChannel(),
)

# We want to quantize all the outputs in the model, too.
config.add_rule(
    "**/[quantizer:activation/output]",
    ff.nn.LinearQuantizer,
    num_bits=8,
    symmetric=False,
    granularity=ff.PerTensor(),
)


# We only want to enable the input quantizer of the first layer, so that we can turn a floating point input into a quantized input.
# For the subsequent layers, the input will already be quantized because there will be an output quantizer in the layer before that.
def input_factory(name: str, current_quantizer: ff.nn.Quantizer) -> ff.nn.Quantizer:
    return ff.nn.LinearQuantizer(num_bits=8, symmetric=False, granularity=ff.PerTensor())


config.add_rule("0/[quantizer:activation/input]", input_factory)

config

Out[20]:

QuantizationConfig(
  **/[quantizer:parameter/weight] => LinearQuantizer(num_bits=8, symmetric=True, granularity=PerChannel(channel=0)),
  **/[quantizer:activation/output] => LinearQuantizer(num_bits=8, symmetric=False, granularity=PerTensor()),
  0/[quantizer:activation/input] => input_factory(<module_name>, <current_quantizer>)
)

✅ Note that in the rule for input quantizers we could have directly specified the quantizer class, but we instead show an example of using a factory function. This function receives both the name and the current quantizer at the given location. This can be an initialized quantizer or a QuantizerStub.

⏩ Applying the QuantizationConfig to the model is very simple:

In [21]:

Copied!

config.initialize(quantized_model)
quantized_model
config.initialize(quantized_model)
quantized_model

Out[21]:

QuantizedSequential(
  (0): QuantizedLinear(
    in_features=4, out_features=3, bias=True
    (input_quantizer): LinearQuantizer(num_bits=8, symmetric=False, granularity=PerTensor())
    (weight_quantizer): LinearQuantizer(num_bits=8, symmetric=True, granularity=PerChannel(channel=0))
    (bias_quantizer): QuantizerStub()
    (output_quantizer): LinearQuantizer(num_bits=8, symmetric=False, granularity=PerTensor())
  )
  (1): QuantizedRelu(
    (input_quantizer): QuantizerStub()
    (output_quantizer): LinearQuantizer(num_bits=8, symmetric=False, granularity=PerTensor())
  )
  (2): QuantizedLinear(
    in_features=3, out_features=3, bias=True
    (input_quantizer): QuantizerStub()
    (weight_quantizer): LinearQuantizer(num_bits=8, symmetric=True, granularity=PerChannel(channel=0))
    (bias_quantizer): QuantizerStub()
    (output_quantizer): LinearQuantizer(num_bits=8, symmetric=False, granularity=PerTensor())
  )
  (3): QuantizedRelu(
    (input_quantizer): QuantizerStub()
    (output_quantizer): LinearQuantizer(num_bits=8, symmetric=False, granularity=PerTensor())
  )
  (4): QuantizedLinear(
    in_features=3, out_features=8, bias=True
    (input_quantizer): QuantizerStub()
    (weight_quantizer): LinearQuantizer(num_bits=8, symmetric=True, granularity=PerChannel(channel=0))
    (bias_quantizer): QuantizerStub()
    (output_quantizer): LinearQuantizer(num_bits=8, symmetric=False, granularity=PerTensor())
  )
  (5): QuantizedRelu(
    (input_quantizer): QuantizerStub()
    (output_quantizer): LinearQuantizer(num_bits=8, symmetric=False, granularity=PerTensor())
  )
)

✅ Observe that the quantizers in our quantized model are now setup up as expected.

⏩ All we have to do now is estimate the ranges for the quantizers, and we can use the quantized model!

In [22]:

Copied!

with ff.range_setting.estimate_ranges(quantized_model, ff.range_setting.smoothed_minmax):
    quantized_model(data)

quantized_model(data)
with ff.range_setting.estimate_ranges(quantized_model, ff.range_setting.smoothed_minmax):
    quantized_model(data)

quantized_model(data)

Out[22]:

QuantizedTensor([[ -83., -128.,  110., -128.,  127., -128., -128.,  -75.]],
                grad_fn=<AliasBackward0>,
                quantizer=TiledAffineQuantizationFunction, scale=tensor([0.0017], grad_fn=<TiledAffineQuantizationFunctionFunctionBackward>), tile_size=data_shape, num_bits=8, output_dtype=torch.float32, offset=tensor([128.], grad_fn=<TiledAffineQuantizationFunctionFunctionBackward>))

4.3 Quantizing Custom Modules: Manual Quantization¶

Your model might not only consist of torch.nn.Modules, but also contain 3rd party or custom modules. Because FastForward does not have fully automated quantization yet, trying to convert these modules using quantize_model does not work. Let us build such a custom module:

In [23]:

Copied!





class MySelfAttentionLayer(torch.nn.Module):
    def __init__(self, feature_size):
        print("Calling MySelfAttentionLayer.__init__")
        super().__init__()
        self.feature_size = feature_size

        # Linear transformations for Q, K, V from the same source
        self.key = torch.nn.Linear(feature_size, feature_size)
        self.query = torch.nn.Linear(feature_size, feature_size)
        self.value = torch.nn.Linear(feature_size, feature_size)

    def forward(self, x):
        print("Calling MySelfAttentionLayer.forward")
        # Apply linear transformations
        keys = self.key(x)
        queries = self.query(x)
        values = self.value(x)

        # Scaled dot-product attention
        scores = torch.matmul(queries, keys.transpose(-2, -1))
        scores = scores / torch.sqrt(torch.tensor(self.feature_size, dtype=torch.float32))

        # Apply softmax
        attention_weights = torch.nn.functional.softmax(scores, dim=-1)

        # Multiply weights with values
        output = torch.matmul(attention_weights, values)

        return output, attention_weights
class MySelfAttentionLayer(torch.nn.Module):
    def __init__(self, feature_size):
        print("Calling MySelfAttentionLayer.__init__")
        super().__init__()
        self.feature_size = feature_size

        # Linear transformations for Q, K, V from the same source
        self.key = torch.nn.Linear(feature_size, feature_size)
        self.query = torch.nn.Linear(feature_size, feature_size)
        self.value = torch.nn.Linear(feature_size, feature_size)

    def forward(self, x):
        print("Calling MySelfAttentionLayer.forward")
        # Apply linear transformations
        keys = self.key(x)
        queries = self.query(x)
        values = self.value(x)

        # Scaled dot-product attention
        scores = torch.matmul(queries, keys.transpose(-2, -1))
        scores = scores / torch.sqrt(torch.tensor(self.feature_size, dtype=torch.float32))

        # Apply softmax
        attention_weights = torch.nn.functional.softmax(scores, dim=-1)

        # Multiply weights with values
        output = torch.matmul(attention_weights, values)

        return output, attention_weights

In [24]:

Copied!

num_features = 8

my_unquantized_layer = MySelfAttentionLayer(num_features)

my_unquantized_layer
num_features = 8

my_unquantized_layer = MySelfAttentionLayer(num_features)

my_unquantized_layer

Calling MySelfAttentionLayer.__init__

Out[24]:

MySelfAttentionLayer(
  (key): Linear(in_features=8, out_features=8, bias=True)
  (query): Linear(in_features=8, out_features=8, bias=True)
  (value): Linear(in_features=8, out_features=8, bias=True)
)

In [25]:

Copied!





my_quantized_layer = copy.deepcopy(my_unquantized_layer)
try:
    ff.quantize_model(my_quantized_layer)
except ff.exceptions.QuantizationError as e:
    print("[ERROR]", e, "\n")

print("ff.quantized_module_map():")
pprint(ff.quantized_module_map())
my_quantized_layer = copy.deepcopy(my_unquantized_layer)
try:
    ff.quantize_model(my_quantized_layer)
except ff.exceptions.QuantizationError as e:
    print("[ERROR]", e, "\n")

print("ff.quantized_module_map():")
pprint(ff.quantized_module_map())

[ERROR] Cannot quantize model because no quantized version of the following modules is known:
  - __main__.MySelfAttentionLayer
It is possible that quantized definitions of one or more of these models
exists, but have not been imported." 

ff.quantized_module_map():
{<class 'torch.nn.modules.conv.Conv1d'>: <class 'fastforward.nn.conv.QuantizedConv1d'>,
 <class 'torch.nn.modules.conv.Conv2d'>: <class 'fastforward.nn.conv.QuantizedConv2d'>,
 <class 'torch.nn.modules.linear.Linear'>: <class 'fastforward.nn.linear.QuantizedLinear'>,
 <class 'torch.nn.modules.activation.ReLU'>: <class 'fastforward.nn.activations.QuantizedRelu'>,
 <class 'torch.nn.modules.activation.SiLU'>: <class 'fastforward.nn.activations.QuantizedSilu'>,
 <class 'torch.nn.modules.container.Sequential'>: <class 'fastforward.nn.container.QuantizedSequential'>,
 <class 'torch.nn.modules.container.ModuleList'>: <class 'fastforward.nn.container.QuantizedModuleList'>,
 <class 'torch.nn.modules.container.ModuleDict'>: <class 'fastforward.nn.container.QuantizedModuleDict'>,
 <class 'torch.nn.modules.container.ParameterList'>: <class 'fastforward.nn.container.QuantizedParameterList'>,
 <class 'torch.nn.modules.container.ParameterDict'>: <class 'fastforward.nn.container.QuantizedParameterDict'>,
 <class 'torch.nn.modules.normalization.LayerNorm'>: <class 'fastforward.nn.normalization.QuantizedLayerNorm'>,
 <class 'torch.nn.modules.sparse.Embedding'>: <class 'fastforward.nn.embedding.QuantizedEmbedding'>}

❌ Observe that ff.quantize_model does not work, because it does not know which class MySelfAttentionLayer should be mapped to.

⏩ For now, we have to manually define the quantized equivalent of MySelfAttentionLayer, we will show how to do so in the next section:

In [26]:

Copied!





class MyQuantizedSelfAttentionLayer(MySelfAttentionLayer, ff.nn.quantized_module.QuantizedModule):
    def __init_quantization__(self) -> None:
        print("Calling MyQuantizedSelfAttentionLayer.__init_quantization__")
        super().__init_quantization__()

        self.attention_scores_output_quantizer = ff.nn.QuantizerStub(output_quantizer=True)
        self.attention_weights_output_quantizer = ff.nn.QuantizerStub(output_quantizer=True)
        self.attention_features_output_quantizer = ff.nn.QuantizerStub(output_quantizer=True)

    # This function is only wrapped for demonstration purposes
    def quantize_children(self, *args, **kwargs) -> None:
        print("Calling MyQuantizedSelfAttentionLayer.quantize_children")
        super().quantize_children(*args, **kwargs)

    def forward(self, x):
        print("Calling MyQuantizedSelfAttentionLayer.forward")
        # Apply linear transformations
        keys = self.key(x)
        queries = self.query(x)
        values = self.value(x)

        # Scaled dot-product attention
        scores = ff.nn.functional.matmul(
            queries,
            keys.transpose(-2, -1),
            output_quantizer=self.attention_scores_output_quantizer,
        )
        scores = scores / torch.sqrt(torch.tensor(self.feature_size, dtype=torch.float32))

        # Apply softmax
        attention_weights = ff.nn.functional.softmax(
            scores, dim=-1, output_quantizer=self.attention_weights_output_quantizer
        )

        # Multiply weights with values
        output = ff.nn.functional.matmul(
            attention_weights,
            values,
            output_quantizer=self.attention_features_output_quantizer,
        )

        return output, attention_weights
class MyQuantizedSelfAttentionLayer(MySelfAttentionLayer, ff.nn.quantized_module.QuantizedModule):
    def __init_quantization__(self) -> None:
        print("Calling MyQuantizedSelfAttentionLayer.__init_quantization__")
        super().__init_quantization__()

        self.attention_scores_output_quantizer = ff.nn.QuantizerStub(output_quantizer=True)
        self.attention_weights_output_quantizer = ff.nn.QuantizerStub(output_quantizer=True)
        self.attention_features_output_quantizer = ff.nn.QuantizerStub(output_quantizer=True)

    # This function is only wrapped for demonstration purposes
    def quantize_children(self, *args, **kwargs) -> None:
        print("Calling MyQuantizedSelfAttentionLayer.quantize_children")
        super().quantize_children(*args, **kwargs)

    def forward(self, x):
        print("Calling MyQuantizedSelfAttentionLayer.forward")
        # Apply linear transformations
        keys = self.key(x)
        queries = self.query(x)
        values = self.value(x)

        # Scaled dot-product attention
        scores = ff.nn.functional.matmul(
            queries,
            keys.transpose(-2, -1),
            output_quantizer=self.attention_scores_output_quantizer,
        )
        scores = scores / torch.sqrt(torch.tensor(self.feature_size, dtype=torch.float32))

        # Apply softmax
        attention_weights = ff.nn.functional.softmax(
            scores, dim=-1, output_quantizer=self.attention_weights_output_quantizer
        )

        # Multiply weights with values
        output = ff.nn.functional.matmul(
            attention_weights,
            values,
            output_quantizer=self.attention_features_output_quantizer,
        )

        return output, attention_weights

⏩ Notice that we made two changes to the model:

We have re-implemented the forward pass, replacing all operations from torch.nn.functional with their FastForward quantized equivalent.
1. ❌ Untill autoquant is implemented in FastForward, this means we manually need to duplicate the code from the forward pass.
2. ⚠️ NOTE: Some of the functionals might be hidden inside a function that is called in your forward pass, make sure to also rewrite those cases.
3. ⚠️ If you are adopting a 3rd party class, you will need to copy-paste the code from the forward pass. Make sure to also freeze the dependency so that your rewritten module will not diverge once the package is updated!
4. In order to use the quantized functionals, we have added Quantizers to the model:
We added an __init_quantization__ method that adds the QuantizerStubs which could be used later for quantization.
1. ✅ We do not have to copy-paste any code from the __init__ function
2. ✅ As we will see below, __init_quantization__ can be used both for initializing a QuantizedModule from scratch, or to convert a Module to a QuantizedModule

⏩ Let's have a look to see how our MyQuantizedSelfAttentionLayer behaves when initialized from scratch:

In [27]:

Copied!

new_quantized_layer = MyQuantizedSelfAttentionLayer(num_features)
new_quantized_layer
new_quantized_layer = MyQuantizedSelfAttentionLayer(num_features)
new_quantized_layer

Calling MySelfAttentionLayer.__init__
Calling MyQuantizedSelfAttentionLayer.__init_quantization__

Out[27]:

MyQuantizedSelfAttentionLayer(
  (key): Linear(in_features=8, out_features=8, bias=True)
  (query): Linear(in_features=8, out_features=8, bias=True)
  (value): Linear(in_features=8, out_features=8, bias=True)
  (attention_scores_output_quantizer): QuantizerStub()
  (attention_weights_output_quantizer): QuantizerStub()
  (attention_features_output_quantizer): QuantizerStub()
)

⏩ Observe that:

MySelfAttentionLayer.__init__ is first called, initializing the layer using the logic of the unquantized base layer.
MyQuantizedSelfAttentionLayer.__init_quantization__ is then called, inserting the quantizer stubs.
The children modules are not converted to their quantized counterparts when initializing from scratch.

⏩ In practice, we will typically not initialize quantized modules from scratch, but we will rather take a floating point model and recursively convert all it's submodules.

⏩ We will now look how MyQuantizedSelfAttentionLayer behaves when converted using the quantize_model function. First, let's look at the quantized_module_map:

In [28]:

Copied!

print("ff.quantized_module_map():")
pprint(ff.quantized_module_map()[MySelfAttentionLayer])
print("ff.quantized_module_map():")
pprint(ff.quantized_module_map()[MySelfAttentionLayer])

ff.quantized_module_map():
<class '__main__.MyQuantizedSelfAttentionLayer'>

✅ Note that MySelfAttentionLayer automatically appeared in the quantized_module_map!

⚠️ All subclasses of QuantizedModel are automatically found in fastforward.nn.quantized_module.quantized_module_map(), but this requires the classes to be imported. If your class does not show up, make sure to import it, or use the extra_conversion argument if you want to override any mappings in the quantized_module_map.

⏩ We will now look how MyQuantizedSelfAttentionLayer behaves when converted using the quantize_model function.

In [29]:

Copied!

my_quantized_layer = copy.deepcopy(my_unquantized_layer)
ff.quantize_model(my_quantized_layer)

my_quantized_layer
my_quantized_layer = copy.deepcopy(my_unquantized_layer)
ff.quantize_model(my_quantized_layer)

my_quantized_layer

Calling MyQuantizedSelfAttentionLayer.__init_quantization__
Calling MyQuantizedSelfAttentionLayer.quantize_children

Out[29]:

MyQuantizedSelfAttentionLayer(
  (key): QuantizedLinear(
    in_features=8, out_features=8, bias=True
    (input_quantizer): QuantizerStub()
    (weight_quantizer): QuantizerStub()
    (bias_quantizer): QuantizerStub()
    (output_quantizer): QuantizerStub()
  )
  (query): QuantizedLinear(
    in_features=8, out_features=8, bias=True
    (input_quantizer): QuantizerStub()
    (weight_quantizer): QuantizerStub()
    (bias_quantizer): QuantizerStub()
    (output_quantizer): QuantizerStub()
  )
  (value): QuantizedLinear(
    in_features=8, out_features=8, bias=True
    (input_quantizer): QuantizerStub()
    (weight_quantizer): QuantizerStub()
    (bias_quantizer): QuantizerStub()
    (output_quantizer): QuantizerStub()
  )
  (attention_scores_output_quantizer): QuantizerStub()
  (attention_weights_output_quantizer): QuantizerStub()
  (attention_features_output_quantizer): QuantizerStub()
)

⏩ Observe that:

Since we convert an existing layer, MySelfAttentionLayer.__init__ is not called again.
The class of our module is changed from MySelfAttentionLayer to MyQuantizedSelfAttentionLayer.
MyQuantizedSelfAttentionLayer.__init_quantization__ is still called, inserting the quantizer stubs into the previously unquantized layer.
The children modules are also converted to their quantized counterparts by calling MyQuantizedSelfAttentionLayer.quantize_children.

5. Quantizing 3rd party models (Huggingface OPT)¶

Based on the tutorial above you should be able to manually quantize any model. We will now show how we quantized the OPT model in our fast-models benchmark repository

The process of adopting the model consists of the following steps (which are explained in the notebook above):

Downloading the existing model code from the huggingface library.
1. ⚠️ Because we will both be using huggingface as a library, but also copy-paste huggingface code we need to freeze our huggingface dependency so that it matches the version we copied the code from.
2. ⏩ Have a look at this commit to see how we conducted this step.
Cleaning the existing model code
1. We remove everything except the modules that are used in the (OPT) model we aim to quantize.
2. Of those modules, we only keep the forward pass.
3. ⏩ Have a look at this commit to see how we conducted this step.
Modifying the existing model code
1. We change all the functionals in the forward pass to their quantized counterparts.
⚠️ NOTE: Sometimes the functionals might be hidden inside a function (such as the quantized_masked_attention function in the OPT example), take care to also detect and convert those.
1. We add an __init_quantization__ method that adds the required quantizers which are used in the quantized functionals.
2. ⏩ Have a look at this commit to see how we conducted this step.
Adding code to insert the QuantizerStubs by adding a QuantizationConfig
1. We make a QuantizationConfig that determines where to insert quantizers based on our experiment settings.
2. ⏩ Have a look at this commit to see how we conducted this step.
Running the full benchmark experiments
1. ⏩ Have a look at this commit to see how we conducted this step.