The success of generative AI applications in various industries has caught the attention of companies worldwide. These companies are interested in replicating and surpassing the achievements of their competitors or finding new and exciting use cases. To power their generative AI innovation, they are exploring foundation models like TII Falcon, Stable Diffusion XL, or OpenAI’s GPT-3.5. Foundation models are a type of generative AI model that can understand and generate human-like content due to their training on vast amounts of unstructured data. These models have transformed computer vision and natural language processing tasks, such as image generation, translation, and question answering. They are essential building blocks for many AI applications and play a critical role in developing advanced intelligent systems. However, deploying foundation models can be challenging, particularly in terms of cost and resource requirements. These models are typically large, with hundreds of millions to billions of parameters. Their size necessitates extensive computational resources, including powerful hardware and significant memory capacity. Deploying foundation models often requires at least one GPU, if not more, to handle the computational load efficiently. For instance, the TII Falcon-40B Instruct model needs at least an ml.g5.12xlarge instance to be successfully loaded into memory, but it performs even better with larger instances. Consequently, deploying and maintaining these models can have a low return on investment (ROI) in terms of proving business value, especially during development cycles or for sporadic workloads. The running costs of having GPU-powered instances for extended periods, potentially 24/7, contribute to this issue. Earlier this year, we introduced Amazon Bedrock, a serverless API that provides access to foundation models from Amazon and our generative AI partners. Although it is currently in Private Preview, its serverless API enables users to utilize foundation models from Amazon, Anthropic, Stability AI, and AI21 without the need to deploy any endpoints themselves. However, the availability of open-source models from communities like Hugging Face has been growing, and not all of them have been made accessible through Amazon Bedrock. In this article, we address these situations and tackle the problem of high costs associated with deploying large foundation models to Amazon SageMaker asynchronous endpoints from Amazon SageMaker JumpStart. This approach helps reduce architecture costs by allowing the endpoint to run only when there are requests in the queue and for a short time-to-live. When there are no requests waiting to be processed, the endpoint scales down to zero, resulting in cost savings. This solution is beneficial for many use cases. However, an endpoint that has scaled down to zero will experience a cold start time before it can serve inferences. Here is an overview of our solution architecture: We use a notebook as the user interface, which can be replaced by a web UI built on Streamlit or a similar technology. In this case, the notebook is an Amazon SageMaker Studio notebook running on an ml.m5.large instance with the PyTorch 2.0 Python 3.10 CPU kernel. The notebook queries the endpoint in three ways: the SageMaker Python SDK, the AWS SDK for Python (Boto3), and LangChain. The endpoint runs asynchronously on SageMaker, and we deploy the Falcon-40B Instruct model on the endpoint. This model is currently the state of the art in terms of instruct models and is available in SageMaker JumpStart. With a single API call, we can deploy the model on the endpoint. SageMaker asynchronous inference is one of the four deployment options in SageMaker, along with real-time endpoints, batch inference, and serverless inference. It is suitable for requests with large payload sizes up to 1 GB, long processing times, and near-real-time latency requirements. However, the main advantage it provides when dealing with large foundation models, especially during proof of concept (POC) or development, is the ability to configure asynchronous inference to scale down to zero instances when there are no requests to process. This helps save costs. To deploy an asynchronous inference endpoint, you need to create an AsyncInferenceConfig object. If you create the configuration without specifying arguments, the default S3OutputPath will be s3://sagemaker-{REGION}-{ACCOUNTID}/async-endpoint-outputs/{UNIQUE-JOB-NAME}, and S3FailurePath will be s3://sagemaker-{REGION}-{ACCOUNTID}/async-endpoint-failures/{UNIQUE-JOB-NAME}. SageMaker JumpStart is a feature of SageMaker that accelerates the machine learning (ML) journey by providing pre-trained models, solution templates, and example notebooks. It offers access to a wide range of pre-trained models for different problem types, allowing users to start their ML tasks with a solid foundation. SageMaker JumpStart also provides solution templates for common use cases and example notebooks for learning. It helps reduce the time and effort required to start ML projects with one-click solution launches and comprehensive resources for practical ML experience. To deploy the model to SageMaker, our first step is to use the UI for SageMaker JumpStart or the SageMaker Python SDK. The SDK provides an API that allows us to deploy the model to the asynchronous endpoint. This process takes approximately 10 minutes to complete. During this time, the endpoint is set up, the container with the model artifacts is downloaded to the endpoint, the model configuration is loaded from SageMaker JumpStart, and finally, the asynchronous endpoint is exposed via a DNS endpoint. To ensure that our endpoint can scale down to zero instances, we need to configure auto scaling using Application Auto Scaling. This involves registering the endpoint variant with Application Auto Scaling, defining a scaling policy, and applying the policy. In this configuration, we use a custom metric called ApproximateBacklogSizePerInstance as part of the scaling policy. This metric is a CustomizedMetricSpecification and is used to determine the scaling behavior. After successfully setting up the scaling policy, we can verify it in the SageMaker console by navigating to Endpoints under Inference and looking for the deployed endpoint. To invoke the asynchronous endpoint, we need to store the request payload in Amazon Simple Storage Service (Amazon S3) and provide a pointer to this payload in the InvokeEndpointAsync request. SageMaker queues the request for processing and returns an identifier and output location as a response. Once processed, the result is placed in the specified Amazon S3 location. It is possible to receive success or error notifications for each request.
Optimizing Deployment Cost of Amazon SageMaker JumpStart Foundation Models using Asynchronous Endpoints
by instadatahelp | Sep 5, 2023 | AI Blogs