Stay humble. Stay hungry. Stay foolish.

  1. Working with Mixed Precision
    1. Support: FP32, FP16, and INT8.
      1. TensorRT is flexible to choose a higher precision kernel if that is faster or the specified precision kernel does not exist.
      2. To restrict the types, use kSTRICT_TYPES. TensorRT may still use a higher precision kernel if the specified precision kernel does. to exist.
    2. C++API
      1. Support:
      2. Layer calculation and output precision setting.
        1. Heterogeneous precision for part of the calculations.
      3.  TF32 Inference.
        1. Internal: FP32 data. FP16 multiplication. FP32 accumulation.
        2. Usage: It is the default. No way to force.
      4. FP16 Inference.
        1. Internal: Weights can be FP32 or FP16.
        2. Usage: Set kINT8.
      5. INT8 Inference.
        1. Internal:
          1. Symmetric quantization.
          2. FP32 activation tensors and weights need to be quantized.
          3. The dynamic range determines the quantization scale.
        2. Usage:
          1. Dynamic Range
            1. Manually set the dynamic range for each tensor.
              1. Input / Output. setDynamicRange.
            2. Use INT8 calibration to generate per tensor dynamic range using the calibration dataset.
          2.  Calibration:
            1. Calibrator:
              1. IEntropyCalibrationV2 (CNN), IMinMaxCalibration (BERT), IEntropyCalibration, ILegacyCalibration
            2. Steps:
              1. Build FP32 engine. Run-on calibration set. Record the histogram of activation tensors.
              2. Build a calibration table.
              3. Build INT8 engine from the calibration table and network definition.
            3. Optimization:
              1. Calibration table cache for multiple build. However users’ responsibility to invalidate the cache.
      6. Explicit Precision
        1. Define: Precision of all layers and tensors ion the network can be inferred. Enable import of quantized models with explicit quantizing and dequantizing scale layers (QDQ scale layers) in TensorRT.
        2. Use: Weights and bias are expected as FP32. Quatinized layers should have QDQ scale layers inserted at input/output.
          1. Activation tensors and weights are quantized and dequantized using scale parameters computed using quantization aware training.
  2. Working with Reformat-Free Network I/O Tensors:
    1. Reason:
      1. ASIL requires accessing GPU memory space should be removed.
      2. TensorRT < 6.0.1 assumes that I/O tensors are FP32.
    2. Usage:
      1. Set build flags through bitmask.
      2. If no implementation of a reformat-free path is found, then the warning message will arise.
    3. Support:
      1. https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reformat-free-support
    4. Set up calibration:
      1. Provide calibration data in FP32. In the range [-128.0f, 127.0f]
    5. Restrictions with DLA (Deep learning accelerator):
      1. TensorRT always inserts reformat layers at DLA boundaries. Because of different memory alignment requirement.

Tags

Leave a comment