We are thrilled to announce that response streaming is now available through Amazon SageMaker real-time inference. This new feature allows you to continuously stream inference responses back to the client when using SageMaker real-time inference, enabling you to build interactive experiences for generative AI applications like chatbots, virtual assistants, and music generators. By leveraging response streaming, you can start receiving responses immediately as they are generated, instead of waiting for the entire response to be generated. This significantly reduces the time-to-first-byte for your generative AI applications.

In this post, we will demonstrate how to build a streaming web application using SageMaker real-time endpoints with the new response streaming feature for an interactive chat use case. We will be using Streamlit for the sample demo application UI.

To enable response streaming from SageMaker, you can utilize the new InvokeEndpointWithResponseStream API. This API enhances customer satisfaction by delivering a faster time-to-first-response-byte. This reduction in perceived latency is particularly important for generative AI models, where immediate processing is prioritized over waiting for the complete payload. Additionally, this feature introduces a sticky session that ensures continuity in interactions, resulting in more natural and efficient user experiences, especially for chatbots.

The implementation of response streaming in SageMaker real-time endpoints is achieved through HTTP 1.1 chunked encoding, a mechanism for sending multiple responses. This HTTP standard is supported by most client/server frameworks and enables the streaming of both text and image data. Models hosted on SageMaker endpoints can now send back streamed responses as text or image, such as Falcon, Llama 2, and Stable Diffusion models. Furthermore, both the input and output are secured using TLS with AWS Sigv4 Auth.

To take advantage of the new streaming API, your model container needs to return the streamed response as chunked encoded data. The high-level architecture for response streaming with a SageMaker inference endpoint is illustrated in the diagram.

One of the key use cases that will benefit from response streaming is generative AI model-powered chatbots. Traditionally, users have to send a query and wait for the entire response to be generated before receiving an answer, which can take a significant amount of time. With response streaming, the chatbot can start sending back partial inference results as they are generated. This means that users can see the initial response almost instantly, even as the AI continues refining its answer in the background. This creates a seamless and engaging conversation flow, where users feel like they are chatting with an AI that understands and responds in real time.

In this post, we will showcase two container options to create a SageMaker endpoint with response streaming: using an AWS Large Model Inference (LMI) container and a Hugging Face Text Generation Inference (TGI) container. We will walk you through the detailed implementation steps to deploy and test the Falcon-7B-Instruct model using both LMI and TGI containers on SageMaker. However, it’s important to note that any model can take advantage of this new streaming feature.

Before proceeding with the implementation, there are a few prerequisites. You will need an AWS account with an IAM role that has the necessary permissions to manage resources created as part of the solution. If you are new to Amazon SageMaker Studio, you will need to create a SageMaker domain. Additionally, you may need to request a service quota increase for the corresponding SageMaker hosting instances that you plan to use. You can find more information on these prerequisites in the provided links.

Option 1: Deploy a real-time streaming endpoint using an LMI container
The LMI container is designed for hosting large language models (LLMs) on AWS infrastructure to enable low-latency inference use cases. It utilizes Deep Java Library (DJL) Serving, an open-source, high-level, engine-agnostic Java framework for deep learning. The LMI container supports various open-source libraries like DeepSpeed, Accelerate, Transformers-neuronx, and FasterTransformer to partition model parameters using model parallelism techniques. This allows you to utilize the memory of multiple GPUs or accelerators for inference. You can find more details on the benefits of using the LMI container in the provided links.

For the LMI container, you will need the following artifacts to set up the model for inference: serving.properties (required), model.py (optional), and requirements.txt (optional). These artifacts define the model server settings, the core inference logic (if required), and any additional pip packages needed for the model.

To create the SageMaker model, you will need to retrieve the container image URI. This can be done using the SageMaker Python SDK. Once you have the image URI, you can use the SDK to create the SageMaker model and deploy it to a real-time endpoint. You will need to specify the instance type, endpoint name, and other relevant configurations.

When the endpoint is in service, you can use the InvokeEndpointWithResponseStream API call to invoke the model and receive responses as a stream of parts. The response content type for the LMI container is application/jsonlines, which can be deserialized using the default deserializer provided by the SageMaker Python SDK. The provided code demonstrates how to parse the response stream using a LineIterator class.

By following these steps, you can successfully deploy a real-time streaming endpoint using an LMI container on SageMaker. This approach allows you to leverage the response streaming feature and build interactive chatbots or other generative AI applications.

Note: The above explanation is a summary of the content provided and may not include all the details and code snippets present in the original content.