- Working with Mixed Precision
- Support: FP32, FP16, and INT8.
- TensorRT is flexible to choose a higher precision kernel if that is faster or the specified precision kernel does not exist.
- To restrict the types, use kSTRICT_TYPES. TensorRT may still use a higher precision kernel if the specified precision kernel does. to exist.
- C++API
- Support:
- Layer calculation and output precision setting.
- Heterogeneous precision for part of the calculations.
- TF32 Inference.
- Internal: FP32 data. FP16 multiplication. FP32 accumulation.
- Usage: It is the default. No way to force.
- FP16 Inference.
- Internal: Weights can be FP32 or FP16.
- Usage: Set kINT8.
- INT8 Inference.
- Internal:
- Symmetric quantization.
- FP32 activation tensors and weights need to be quantized.
- The dynamic range determines the quantization scale.
- Usage:
- Dynamic Range
- Manually set the dynamic range for each tensor.
- Input / Output. setDynamicRange.
- Use INT8 calibration to generate per tensor dynamic range using the calibration dataset.
- Manually set the dynamic range for each tensor.
- Calibration:
- Calibrator:
- IEntropyCalibrationV2 (CNN), IMinMaxCalibration (BERT), IEntropyCalibration, ILegacyCalibration
- Steps:
- Build FP32 engine. Run-on calibration set. Record the histogram of activation tensors.
- Build a calibration table.
- Build INT8 engine from the calibration table and network definition.
- Optimization:
- Calibration table cache for multiple build. However users’ responsibility to invalidate the cache.
- Calibrator:
- Dynamic Range
- Internal:
- Explicit Precision
- Define: Precision of all layers and tensors ion the network can be inferred. Enable import of quantized models with explicit quantizing and dequantizing scale layers (QDQ scale layers) in TensorRT.
- Use: Weights and bias are expected as FP32. Quatinized layers should have QDQ scale layers inserted at input/output.
- Activation tensors and weights are quantized and dequantized using scale parameters computed using quantization aware training.
- Support: FP32, FP16, and INT8.
- Working with Reformat-Free Network I/O Tensors:
- Reason:
- ASIL requires accessing GPU memory space should be removed.
- TensorRT < 6.0.1 assumes that I/O tensors are FP32.
- Usage:
- Set build flags through bitmask.
- If no implementation of a reformat-free path is found, then the warning message will arise.
- Support:
- Set up calibration:
- Provide calibration data in FP32. In the range [-128.0f, 127.0f]
- Restrictions with DLA (Deep learning accelerator):
- TensorRT always inserts reformat layers at DLA boundaries. Because of different memory alignment requirement.
- Reason:
Leave a comment