View the runnable example on GitHub
Quantize PyTorch Model in INT8 for Inference using OpenVINO Post-training Optimization Tools#
As Post-training Optimization Tools (POT) is provided by OpenVINO toolkit, OpenVINO acceleration will be enabled in the meantime when using POT for INT8 quantization. You can call InferenceOptimizer.quantize API with accelerator='openvino' (and precision='int8') to use POT for your PyTorch nn.Module. It only takes a few lines.
Let’s take an ResNet-18 model pretrained on ImageNet dataset and finetuned on OxfordIIITPet dataset as an example:
[ ]:
from torchvision.models import resnet18
model = resnet18(pretrained=True)
_, train_dataset, val_dataset = finetune_pet_dataset(model)
The full definition of function finetune_pet_dataset could be found in the runnable example.
To enable INT8 quantization using POT for inference, you could simply import BigDL-Nano InferenceOptimizer, and use InferenceOptimizer to quantize your PyTorch model with accelerator='openvino':
[ ]:
from bigdl.nano.pytorch import InferenceOptimizer
q_model = InferenceOptimizer.quantize(model,
accelerator='openvino',
calib_data=DataLoader(train_dataset, batch_size=32))
📝 Note
The
InferenceOptimizer.quantizefunction has aprecisionparameter to specify the precision for quantization. It is default to be'int8'. So, we omit theprecisionparameter here for INT8 quantization.For IN8 quantization using POT, only static post-training quantization is supported. So
calib_data(for calibration data) is always required whenaccelerator='openvino'.For
calib_data, batch size is not important as it intends to read 100 samples. And there could be no label in calibration data.Please refer to API documentation for more information on
InferenceOptimizer.quantize.
You could then do the normal inference steps under the context manager provided by Nano, with the quantized model:
[ ]:
with InferenceOptimizer.get_context(q_model):
x = torch.stack([val_dataset[0][0], val_dataset[1][0]])
# use the quantized model here
y_hat = q_model(x)
predictions = y_hat.argmax(dim=1)
print(predictions)
📝 Note
For all Nano optimized models by
InferenceOptimizer.quantize, you need to wrap the inference steps with an automatic context managerInferenceOptimizer.get_context(model=...)provided by Nano. You could refer to here for more detailed usage of the context manager.