Stay humble. Stay hungry. Stay foolish.

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html

  1. What is TensorRT
    1. Benefits of TensorRT: High-performance neural network inference optimizer and runtime engine for production deployment.
      1. Combine layers
      2. Optimize kernel selection
      3. Normalization and conversion to optimize matrix math
      4. Lower precision
    2. Where does TensorRT fit
      1. Phase 1: Training (TensorRT is not used)
      2. Phase 2: Developing a deployment solution.
        1. Define the architecture of inference solution
        2. Build an inference engine from saved networks using TensorRT (plan file)
          1. Parsing
          2. Optimization
          3. Validation
          4. Serialization
      3. Phase 3: Deploying a solution
        1. Deserialization
        2. Asynchronous execution
    3. How does TensorRT work
      1. Build platform-specific engines
        1. Optimization includes elimination; fusion of operations; aggregation of operations; merging layers; changing precision; profiling guided optimization; weight pre-formating and memory optimization.
    4. What capabilities does TensorRT provide?
      1. Import, calibrate, generate, deploy optimized neural networks.
      2. Key Interfaces:
        1. Network Definition: Input and output tensors. Supported and customized layers.
        2. Optimization Profile: Impose constraints on dynamic dimensions.
        3. Builder Configuration: Specify details for creating an engine. An interface for 8-bit quantization.
        4. Builder: Create an optimized engine from network definition and builder configuration.
        5. Engine: Synchronous and asynchronous inference execution, profiling, enumeration, and querying. Multiple execution contexts.
        6. Parsers: Caffe parser; UFF parser; ONNX parser.
  2. Using the C++ API
    1. Instantiating TensorRT objects
      1. Create ICudaEngine from
        1. network definition OR
          1. IBuilder -> INetworkDefinition
          2. INetworkDefinition + IParser -> ICudeEngine
        2. serialized data
      2. Create IRuntime
    2. Create A Network Definition
      1. Create
        1. Import model using TensorRT parser library OR
        2. Define the model directly using TensorRT API
      2. Contain
        1. Input and output tensors with names.
        2. Pointers to model weights. (does not own memory)
      3. Examples
        1. Create from scratch
          1. create IBuilder, INetworkDefinition objects
          2. add layers through addXXX methods of INetworkDefinition.
          3. add input / output through addInput / markOutput methods of INetworkDefinition at the beginning / end
        2. Import a model using a parser
          1. create IBuilder, INetworkDefinition objects
          2. create parser object to populate
        3. Detailed examples for Caffee, Tensorflow and ONNX
    3. Building An Engine in C++
      1. What is an engine: Create an optimized runtime. Profiling based. Use the same GPU for building and production.
        1. Important properties:
          1. maximum batch size: for optimization. At runtime, use a smaller batch size.
          2. maximum workspace size: temporary workspace. limits the maximum size of any layer in the network as use.
      2. How to build the engine:
        1. IBuilder -> IBuilderConfig -> ICudaEngine (TensorRT makes copies of weights)
        2. Destroy builder, config, network, and parser (in reverser order).
      3. How to speed up the build process
        1. Build a layer timing cache -> Reuse profiling result.
    4. Serializing a model
      1. ICudaEngine -> IHostMemory
        1. Use the serialize method of ICudaEngine to create an IHostMemory.
      2. IHostMemory -> ICudaEngine
        1. Use the deserializeCudaEngine method to IRuntime.
    5. Perform Inference
      1. Create IExecutionContext for each inference tasks
      2. Get input & output index through the blob names
      3. Set up buffer arrays to the input & output buffers
      4. Enqueue and asynchronous execution
    6. Memory Management
      1. Default: TensorRT manage the memory of IExecutionContext
      2. Customization:
        1. use setDeviceMemory method
        2. implement IGpuAllocator interface
    7. Refitting An Engine
      1. Definition: Replace weights. Not actually performing the refitting.
      2. Process:
        1. Request refittable engine.
        2. Create refittable object.
        3. Update weights.
        4. Find out the required weights. Supply them.
        5. Update the engine with all the weights provided.
        6. Destory refitter.
    8. Algorithm Selection
      1. Default: TensorRT chooses algorithms that globally minimize the execution time of the engine.
      2. Customization:
        1. implement the IAlgorithmSelector interface.
          1. IAlgorithmSelector::selectAlgorithm: return a list of algorithms allowed.
          2. IAlgorithmSelector::reportAlgorithm: record the final choice.
      3. Deterministically. Not Reproducibility.
  3. Using the Python API
  4. Extending TensorRT With Custom Layers

Tags

Leave a comment