Artificial intelligence (AI) adoption is rapidly increasing across various industries and use cases. Recent advancements in deep learning (DL), large language models (LLMs), and generative AI have allowed customers to leverage state-of-the-art solutions that exhibit human-like performance. These complex models often require hardware acceleration to enable faster training and real-time inference using deep neural networks. Graphics Processing Units (GPUs) with their parallel processing cores are well-suited for these DL tasks.
However, DL applications often involve preprocessing or postprocessing in an inference pipeline in addition to model invocation. For instance, in an object detection use case, input images may need to be resized or cropped before being served to a computer vision model. Similarly, text inputs may require tokenization before being used in a large language model. NVIDIA Triton is an open-source inference server that allows users to define such inference pipelines as ensembles of models in the form of a Directed Acyclic Graph (DAG). It is designed to run models at scale on both CPU and GPU.
Amazon SageMaker supports seamless deployment of Triton, enabling users to leverage Triton’s features while benefiting from SageMaker capabilities such as a managed and secured environment with MLOps tools integration, automatic scaling of hosted models, and more. AWS continuously innovates in terms of pricing options, cost optimization proactive services, and cost-saving features like multi-model endpoints (MMEs). MMEs provide a cost-effective solution for deploying multiple models using the same set of resources and a shared serving container. Instead of deploying multiple single-model endpoints, users can reduce hosting costs by deploying multiple models and paying only for a single inference environment. MMEs also reduce deployment overhead as SageMaker manages loading models in memory and scaling them based on traffic patterns.
This post demonstrates how to run multiple deep learning ensemble models on a GPU instance using a SageMaker MME. The post provides a step-by-step solution walkthrough and code examples to deploy two types of ensembles: one involving image preprocessing using the DALI model and TensorFlow Inception v3 model for inference, and another involving text preprocessing using Python, BERT model for token embeddings, and a postprocessing model for sentence embeddings.
SageMaker MMEs with GPU work by hosting multiple models in a single container. SageMaker controls the lifecycle of models by loading and unloading them into the container’s memory as they are invoked. When an invocation request is made, SageMaker routes the request to the endpoint instance. If the model has not been loaded, it is downloaded from Amazon S3 to the instance’s Amazon Elastic Block Storage volume (Amazon EBS) and then loaded into the container’s memory on the GPU-accelerated compute instance. If the model is already loaded, the invocation is faster as no further steps are needed. If additional models need to be loaded and the instance’s memory utilization is high, unused models are unloaded from the container to free up memory. Unloaded models remain on the instance’s EBS volume, eliminating the need to download them again. If the storage volume reaches its capacity, unused models are deleted. In cases of high traffic, additional instances or auto-scaling policies are used to accommodate the load.
To deploy Triton ensembles, users can leverage the Triton server architecture, which includes a model repository for storing models accessible locally or remotely from Amazon S3. Each model in the repository must include a model configuration specifying information about the model, such as the platform or backend, max_batch_size, and input/output tensors. SageMaker enables model deployment using the Triton server through managed Triton Inference Server Containers, which support common ML frameworks and provide environment variables for performance optimization. It is recommended to use SageMaker Deep Learning Containers (DLC) images for their regular maintenance and security updates.
The post also provides instructions for setting up the environment, preparing the ensembles, and packaging the artifacts before deployment. Finally, it demonstrates how to invoke the ensembles using the SageMaker endpoint, specifying the target ensemble and providing the required input.
To follow along with the example and access the code, it is available in the public SageMaker examples repository.