Optimize Llama 2 for Text Generation using Amazon SageMaker JumpStart

Today, we are thrilled to announce that you can now fine-tune Llama 2 models by Meta using Amazon SageMaker JumpStart. The Llama 2 family consists of large language models (LLMs) that are pre-trained and fine-tuned generative text models. These models range in size from 7 billion to 70 billion parameters, with the fine-tuned LLMs optimized for dialogue use cases and called Llama-2-chat.

With SageMaker JumpStart, you have access to a machine learning (ML) hub that offers algorithms, models, and ML solutions to quickly start your ML projects. Now, you can fine-tune the 7 billion, 13 billion, and 70 billion parameters Llama 2 text generation models on SageMaker JumpStart. This can be done easily through the Amazon SageMaker Studio UI or using the SageMaker Python SDK.

Generative AI foundation models have been a prominent focus in ML and artificial intelligence research for over a year. These models excel in generative tasks like text generation, summarization, question answering, and image and video generation due to their large size and training on extensive datasets and tasks. However, there are specific use cases, such as healthcare or financial services, that require fine-tuning of these models with domain-specific data to achieve optimal results.

In this post, we will guide you through the process of fine-tuning Llama 2 pre-trained text generation models using SageMaker JumpStart. Llama 2 is an auto-regressive language model that utilizes an optimized transformer architecture. It is designed for commercial and research use in English and comes in various parameter sizes. The fine-tuned versions of Llama 2 models have undergone supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. These models were pre-trained on 2 trillion tokens of data from publicly available sources. The pre-trained models can be adapted for different natural language generation tasks, while the tuned models are specifically intended for assistant-like chat applications.

To access Llama 2 models, you need to accept the End User License Agreement (EULA). Currently, Llama 2 is available in specific regions for both deploying pre-trained models and fine-tuning and deploying fine-tuned models.

SageMaker JumpStart allows ML practitioners to choose from a wide selection of publicly available foundation models. These models can be deployed on dedicated Amazon SageMaker instances within a network isolated environment, and you can customize them using SageMaker for model training and deployment. By using SageMaker Studio or the SageMaker Python SDK, you can easily discover and deploy Llama 2 models. This enables you to leverage SageMaker’s features like Amazon SageMaker Pipelines, Amazon SageMaker Debugger, and container logs for model performance and MLOps controls. The deployed models are hosted in a secure AWS environment under your VPC controls, ensuring data security.

To fine-tune Llama 2 models, you can choose between the SageMaker Studio UI or the SageMaker Python SDK. In the Studio UI, you can access Llama 2 models under Models, notebooks, and solutions in SageMaker JumpStart. From there, you can specify the Amazon Simple Storage Service (Amazon S3) bucket containing the training and validation datasets for fine-tuning. You can also configure deployment settings, hyperparameters, and security settings. Once you start the training job, the fine-tuning process will commence. After fine-tuning, you can deploy the model using the model page on SageMaker JumpStart.

Alternatively, you can use the SageMaker Python SDK to fine-tune Llama 2 models. The code provided in the post demonstrates how to fine-tune the Llama 2 7 billion parameters model. You can change the model_id to fine-tune the 13 billion or 70 billion parameters models. The code includes dataset preparation, training on your custom dataset, and deploying the fine-tuned model.

To optimize the fine-tuning process for large models like Llama, two techniques are employed. The first technique is Low-Rank Adaptation (LoRA), which is a parameter efficient fine-tuning (PEFT) method. LoRA involves freezing the entire model and adding a small set of adjustable parameters or layers. By fine-tuning less than 1% of the parameters, memory requirements, training time, and costs are significantly reduced. The second technique is Int8 quantization, which further decreases the memory footprint during training. Although this reduces the precision of floating point data types, it mitigates the model’s size without sacrificing performance.

For more details on fine-tuning Llama 2 models and to see performance benchmarking results, refer to the complete post. With the capability to fine-tune Llama 2 models using SageMaker JumpStart, you can customize and optimize these generative AI models for your specific use cases and domain data.