FastForward Autoquant Tutorial#
ff.autoquantize (Autoquant) automatically rewrites a PyTorch module into a
quantized counterpart by analysing the module's forward method and any
helper functions it calls. The generated code inserts QuantizerStub
placeholders at every point where a tensor value is produced or consumed, so
you can later attach real quantizers and calibrate them without touching the
generated file by hand.
This tutorial walks through the most important features of Autoquant with self-contained examples.
Basic Usage#
The simplest way to use Autoquant is to call ff.autoquantize on any
torch.nn.Module. Because FastForward already knows how to quantize the
standard torch.nn.Linear layer, the example below subclasses it and
overrides forward so that Autoquant has something to analyse.
import torch
import torch.nn.functional as F
import fastforward as ff
# Creating a custom linear layer because
# autoquant ignores known modules such
# as torch.nn.Linear
class MyLinear(torch.nn.Linear):
def forward(self, input: torch.Tensor) -> torch.Tensor:
return F.linear(input, self.weight, self.bias)
# Initialize model and call autoquant
model = MyLinear(10, 10)
ff.autoquantize(
model,
output_path="quantized.py", # Write the output to quantized.py
use_type_inference=False, # Do not use type inference
force_overwrite=True,
)
Autoquant generates a new class that inherits from both
fastforward.nn.QuantizedModule and the original module. The
__init_quantization__ method is where all QuantizerStub instances are
created - one for each tensor-valued input, weight, bias, and output. In
forward, every tensor is passed through its corresponding stub before being
used in the computation.
import fastforward
import torch
from __main__ import MyLinear
class QuantizedMyLinear(fastforward.nn.QuantizedModule, MyLinear):
def __init_quantization__(self) -> None:
super().__init_quantization__()
self.quantizer_linear: ... = fastforward.nn.QuantizerStub()
self.quantizer_self_bias: ... = fastforward.nn.QuantizerStub()
self.quantizer_self_weight: ... = fastforward.nn.QuantizerStub()
self.quantizer_input: ... = fastforward.nn.QuantizerStub()
def forward(self, input: torch.Tensor) -> torch.Tensor:
input = self.quantizer_input(input)
self_bias = self.quantizer_self_bias(self.bias)
self_weight = self.quantizer_self_weight(self.weight)
return fastforward.nn.functional.linear(
input, self_weight, self_bias, output_quantizer=self.quantizer_linear
)
Helper functions#
Modules often delegate work to standalone helper functions. Autoquant follows those call chains and generates a quantized version of every helper it encounters, threading the required quantizer arguments through each call site automatically.
import torch
import fastforward as ff
# helper function that uses a helper
# itself
def helper(x, y):
return inner_helper(x, y) + inner_helper(y, x)
def inner_helper(x, y):
return x - y
class MyModuleWithHelper(torch.nn.Module):
def forward(self, x: torch.Tensor) -> torch.Tensor:
return helper(x, y)
model = MyModuleWithHelper()
ff.autoquantize(model, output_path="quantized.py", use_type_inference=False, force_overwrite=True)
Each helper gets its own quantized variant. Notice that quantized_helper
receives all quantizers for both of its quantized_inner_helper calls as
keyword-only arguments. Autoquant names them to avoid collisions (e.g.
quantizer_inner_helper_sub_1 vs quantizer_inner_helper_sub_2).
def quantized_helper(
x, y, *,
quantizer_add: fastforward.nn.Quantizer,
quantizer__tmp_1: fastforward.nn.Quantizer,
quantizer__tmp_2: fastforward.nn.Quantizer,
[...]
quantizer_inner_helper_y_2: fastforward.nn.Quantizer,
):
_tmp_1 = quantized_inner_helper(x, y,
quantizer_sub=quantizer_inner_helper_sub_1,
quantizer_x=quantizer_inner_helper_x_1,
quantizer_y=quantizer_inner_helper_y_1,
)
_tmp_1 = quantizer__tmp_1(_tmp_1)
_tmp_2 = quantized_inner_helper(y, x,
quantizer_sub=quantizer_inner_helper_sub_2,
quantizer_x=quantizer_inner_helper_x_2,
quantizer_y=quantizer_inner_helper_y_2,
)
_tmp_2 = quantizer__tmp_2(_tmp_2)
return fastforward.nn.functional.add(_tmp_1, _tmp_2,
output_quantizer=quantizer_add
)
def quantized_inner_helper(
x,
y,
*,
quantizer_sub: ...,
quantizer_x: ...,
quantizer_y: ...,
):
x = quantizer_x(x)
y = quantizer_y(y)
return fastforward.nn.functional.sub(x, y, output_quantizer=quantizer_sub)
Annotations#
Type annotations on helper functions give Autoquant the information it needs to decide which arguments are tensors (and therefore need quantizers) and which are plain Python values (and can be left alone). Without annotations, Autoquant conservatively treats every argument as a potential tensor.
import torch
import torch.nn.functional as F
import fastforward as ff
# helper function called from module
def helper(x, y):
y = y + y
for y_ in y:
x = x + y_
return x
class MyModuleWithHelper(torch.nn.Module):
def forward(self, x: torch.Tensor) -> torch.Tensor:
return helper(x, [2])
model = MyModuleWithHelper()
ff.autoquantize(model, output_path="quantized.py", use_type_inference=False, force_overwrite=True)
Without annotations and type inference enabled, Autoquant will assume
everything can be a tensor. This leads to unnecessary quantizers being
inserted for the list[int] argument y and the loop variable y_.
def quantized_helper(
x: torch.Tensor,
y: list[int],
*,
quantizer_add_1: fastforward.nn.Quantizer,
quantizer_add_2: fastforward.nn.Quantizer,
quantizer_y_: fastforward.nn.Quantizer,
quantizer_x: fastforward.nn.Quantizer,
quantizer_y: fastforward.nn.Quantizer,
):
x = quantizer_x(x)
y = quantizer_y(y)
y = fastforward.nn.functional.add(y, y, output_quantizer=quantizer_add_1)
for y_ in y:
y_ = quantizer_y_(y_)
x = fastforward.nn.functional.add(x, y_, output_quantizer=quantizer_add_2)
return x
Adding a type annotation y: list[int] tells Autoquant that y is not a
tensor. Enabling use_type_inference=True lets Autoquant propagate that
information through the function body.
import torch
import torch.nn.functional as F
import fastforward as ff
# helper function called from module
def helper(x: torch.Tensor, y: list[int]):
y = y + y
for y_ in y:
x = x + y_
return x
class MyModuleWithHelper(torch.nn.Module):
def forward(self, x: torch.Tensor) -> torch.Tensor:
return helper(x, [2])
model = MyModuleWithHelper()
ff.autoquantize(model, output_path="quantized.py", use_type_inference=True, force_overwrite=True)
With the annotation in place, Autoquant correctly infers that y and y_
are integers. The redundant quantizers for y and y_ disappear, and the
y + y list concatenation is left as plain Python. Annotations can
dramatically improve the result of Autoquant.
def quantized_helper(
x: torch.Tensor,
y: list[int],
*,
quantizer_add: fastforward.nn.Quantizer,
quantizer_y_: fastforward.nn.Quantizer,
quantizer_x: fastforward.nn.Quantizer,
):
x = quantizer_x(x)
y = y + y
for y_ in y:
y_ = quantizer_y_(y_)
x = fastforward.nn.functional.add(x, y_, output_quantizer=quantizer_add)
return x
Operator Table#
Autoquant relies on an internal operator table to know which PyTorch operators have quantized equivalents in FastForward. When an operator is missing from the table, Autoquant cannot quantize it and falls back to the original call.
import torch
import torch.nn.functional as F
import fastforward as ff
# This example seems similar to
# the first Linear example
class MyBilinear(torch.nn.Bilinear):
def forward(self, input1: torch.Tensor, input2: torch.Tensor) -> torch.Tensor:
return F.bilinear(input1, input2, self.weight, self.bias)
model = MyBilinear(10, 10, 10)
ff.autoquantize(model, output_path="quantized.py", use_type_inference=False, force_overwrite=True)
Because F.bilinear is not yet in FastForward's operator table, Autoquant
cannot produce a quantized version of the forward method. The generated
class is essentially a no-op wrapper that calls the original implementation
unchanged.
import fastforward
import torch
import torch.nn.functional as F
from __main__ import MyBilinear
class QuantizedMyBilinear(fastforward.nn.QuantizedModule, MyBilinear):
def __init_quantization__(self) -> None:
super().__init_quantization__()
def forward(self, input1: torch.Tensor, input2: torch.Tensor) -> torch.Tensor:
return F.bilinear(input1, input2, self.weight, self.bias)
You can extend the operator table at runtime by creating a custom
OperatorTable, defining the quantization schema for the missing operator,
and passing the table to ff.autoquantize. The schema string uses a
mini-language that describes which arguments are Quantized,
MaybeQuantized, or plain Python values.
import fastforward as ff
from fastforward.autoquant import OperatorTable, optable
def my_quantized_bilinear(
input1: torch.Tensor,
input2: torch.Tensor,
weight: torch.Tensor,
bias: torch.Tensor | None = None,
) -> torch.Tensor: ...
custom_optable = OperatorTable.from_yaml(alias_extensions=optable.STR_ALIASES_EXTENSIONS)
custom_schema = (
"bilinear(input1: Quantized, input2: Quantized, "
"weight: Quantized, bias: Optional[MaybeQuantized] = None) -> Quantized"
)
custom_optable.add(custom_schema, torch.nn.functional.bilinear, dispatch_op=my_quantized_bilinear)
ff.autoquantize(
model,
output_path="quantized.py",
use_type_inference=False,
force_overwrite=True,
operator_table=custom_optable,
)
With the custom operator registered, Autoquant can now fully quantize
MyBilinear. The generated forward method quantizes all inputs and
weights before dispatching to my_quantized_bilinear.
import fastforward
import torch
from __main__ import MyBilinear
class QuantizedMyBilinear(fastforward.nn.QuantizedModule, MyBilinear):
def __init_quantization__(self) -> None:
super().__init_quantization__()
self.quantizer_bilinear: ... = fastforward.nn.QuantizerStub()
[...]
self.quantizer_input2: ... = fastforward.nn.QuantizerStub()
def forward(self, input1: torch.Tensor, input2: torch.Tensor) -> torch.Tensor:
input1 = self.quantizer_input1(input1)
input2 = self.quantizer_input2(input2)
self_bias = self.quantizer_self_bias(self.bias)
self_weight = self.quantizer_self_weight(self.weight)
return __main__.my_quantized_bilinear(
input1, input2, self_weight, self_bias, output_quantizer=self.quantizer_bilinear
)
Replacement Patterns#
Sometimes you want Autoquant to route calls through your own operator
implementations rather than FastForward's built-ins. PatternRule lets you
express a simple string-template rewrite: any call matching the pattern is
rewritten to the replacement before the rest of Autoquant runs.
import syrupy
import fastforward as ff
import fastforward.autoquant as autoquant
from tests.autoquant.helpers import autoquant_with_defaults, codeformat_with_defaults
class ExampleModule1B(torch.nn.Module):
def forward(self, x: torch.Tensor) -> torch.Tensor:
y = torch.nn.functional.conv2d(x, x)
return torch.nn.functional.linear(y, y)
def test_pattern_based_replacement(snapshot: syrupy.assertion.SnapshotAssertion) -> None:
module = ExampleModule1B()
rule = ff.autoquant.PatternRule.from_str(
pattern="torch.nn.functional.{func}({a}, {b})",
replacement="my_own.{func}({a}, {b})",
)
quantized = autoquant_with_defaults(
module, use_type_inference=False, replacement_patterns=[rule]
)
quantized = codeformat_with_defaults(quantized)
assert snapshot == quantized
The pattern torch.nn.functional.{func}({a}, {b}) matches both conv2d and
linear calls. Autoquant rewrites them to my_own.conv2d and
my_own.linear respectively before generating the quantized module. Note
that because the replacement targets are unknown to Autoquant's operator
table, no quantizer stubs are inserted — the rewritten calls are emitted
verbatim.
import fastforward
import torch
from tests.autoquant.test_autoquant import ExampleModule1B
class QuantizedExampleModule1B(fastforward.nn.QuantizedModule, ExampleModule1B):
def __init_quantization__(self) -> None:
super().__init_quantization__()
def forward(self, x: torch.Tensor) -> torch.Tensor:
y = my_own.conv2d(x, x)
return my_own.linear(y, y)
Conclusion#
Autoquant removes the bulk of the manual work involved in writing quantized PyTorch modules. The key ideas to take away are:
-
Subclass, don't replace. The generated class inherits from both
fastforward.nn.QuantizedModuleand the original module, so it can be dropped in wherever the original was used. -
Helpers are handled automatically. Any standalone function called from
forwardis traced and rewritten alongside the module. Quantizer arguments are threaded through the call chain for you. -
Annotations pay off. Adding type annotations to helper functions — especially for non-tensor arguments — significantly reduces the number of unnecessary quantizers in the output. Pair annotations with
use_type_inference=Truefor the best results. -
Extend the operator table when needed. If Autoquant silently skips an operator, it is almost certainly missing from FastForward's operator table. You can register a custom schema and dispatch function at runtime without waiting for an upstream change.
-
Replacement patterns for custom dispatch.
PatternRulelets you redirect whole families of calls to your own implementations before Autoquant runs, giving you fine-grained control over the generated code.
With these tools combined, Autoquant can handle everything from simple linear layers to complex modules with loops, conditionals, and third-party operators.