https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html
- What is TensorRT
- Benefits of TensorRT: High-performance neural network inference optimizer and runtime engine for production deployment.
- Combine layers
- Optimize kernel selection
- Normalization and conversion to optimize matrix math
- Lower precision
- Where does TensorRT fit
- Phase 1: Training (TensorRT is not used)
- Phase 2: Developing a deployment solution.
- Define the architecture of inference solution
- Build an inference engine from saved networks using TensorRT (plan file)
- Parsing
- Optimization
- Validation
- Serialization
- Phase 3: Deploying a solution
- Deserialization
- Asynchronous execution
- How does TensorRT work
- Build platform-specific engines
- Optimization includes elimination; fusion of operations; aggregation of operations; merging layers; changing precision; profiling guided optimization; weight pre-formating and memory optimization.
- Build platform-specific engines
- What capabilities does TensorRT provide?
- Import, calibrate, generate, deploy optimized neural networks.
- Key Interfaces:
- Network Definition: Input and output tensors. Supported and customized layers.
- Optimization Profile: Impose constraints on dynamic dimensions.
- Builder Configuration: Specify details for creating an engine. An interface for 8-bit quantization.
- Builder: Create an optimized engine from network definition and builder configuration.
- Engine: Synchronous and asynchronous inference execution, profiling, enumeration, and querying. Multiple execution contexts.
- Parsers: Caffe parser; UFF parser; ONNX parser.
- Benefits of TensorRT: High-performance neural network inference optimizer and runtime engine for production deployment.
- Using the C++ API
- Instantiating TensorRT objects
- Create ICudaEngine from
- network definition OR
- IBuilder -> INetworkDefinition
- INetworkDefinition + IParser -> ICudeEngine
- serialized data
- network definition OR
- Create IRuntime
- Create ICudaEngine from
- Create A Network Definition
- Create
- Import model using TensorRT parser library OR
- Define the model directly using TensorRT API
- Contain
- Input and output tensors with names.
- Pointers to model weights. (does not own memory)
- Examples
- Create from scratch
- create IBuilder, INetworkDefinition objects
- add layers through addXXX methods of INetworkDefinition.
- add input / output through addInput / markOutput methods of INetworkDefinition at the beginning / end
- Import a model using a parser
- create IBuilder, INetworkDefinition objects
- create parser object to populate
- Detailed examples for Caffee, Tensorflow and ONNX
- Create from scratch
- Create
- Building An Engine in C++
- What is an engine: Create an optimized runtime. Profiling based. Use the same GPU for building and production.
- Important properties:
- maximum batch size: for optimization. At runtime, use a smaller batch size.
- maximum workspace size: temporary workspace. limits the maximum size of any layer in the network as use.
- Important properties:
- How to build the engine:
- IBuilder -> IBuilderConfig -> ICudaEngine (TensorRT makes copies of weights)
- Destroy builder, config, network, and parser (in reverser order).
- How to speed up the build process
- Build a layer timing cache -> Reuse profiling result.
- What is an engine: Create an optimized runtime. Profiling based. Use the same GPU for building and production.
- Serializing a model
- ICudaEngine -> IHostMemory
- Use the serialize method of ICudaEngine to create an IHostMemory.
- IHostMemory -> ICudaEngine
- Use the deserializeCudaEngine method to IRuntime.
- ICudaEngine -> IHostMemory
- Perform Inference
- Create IExecutionContext for each inference tasks
- Get input & output index through the blob names
- Set up buffer arrays to the input & output buffers
- Enqueue and asynchronous execution
- Memory Management
- Default: TensorRT manage the memory of IExecutionContext
- Customization:
- use setDeviceMemory method
- implement IGpuAllocator interface
- Refitting An Engine
- Definition: Replace weights. Not actually performing the refitting.
- Process:
- Request refittable engine.
- Create refittable object.
- Update weights.
- Find out the required weights. Supply them.
- Update the engine with all the weights provided.
- Destory refitter.
- Algorithm Selection
- Default: TensorRT chooses algorithms that globally minimize the execution time of the engine.
- Customization:
- implement the IAlgorithmSelector interface.
- IAlgorithmSelector::selectAlgorithm: return a list of algorithms allowed.
- IAlgorithmSelector::reportAlgorithm: record the final choice.
- implement the IAlgorithmSelector interface.
- Deterministically. Not Reproducibility.
- Instantiating TensorRT objects
- Using the Python API
- Extending TensorRT With Custom Layers
Leave a comment