Introducing Amazon SageMaker Profiler Preview: Monitor and Visualize In-depth Hardware Performance Data for Model Training Workloads

We are excited to announce the preview of Amazon SageMaker Profiler, a feature of Amazon SageMaker that gives a detailed view of the AWS compute resources used during the training of deep learning models on SageMaker. With SageMaker Profiler, you can monitor and analyze CPU and GPU activities, including utilization, kernel runs, sync operations, memory operations, latencies, and data transfer. In this post, we will explore the capabilities of SageMaker Profiler.

SageMaker Profiler provides Python modules that can be used to annotate PyTorch or TensorFlow training scripts and enable SageMaker Profiler. It also offers a user interface (UI) that visualizes the profile, provides statistical summaries of profiled events, and displays a timeline of training jobs to track and understand the timing of events between GPUs and CPUs.

The Need for Profiling Training Jobs:
As deep learning and machine learning become more compute and data intensive, training large models with trillions of parameters requires efficient resource utilization. This is particularly evident in large language models (LLMs) that have billions of parameters and need large multi-node GPU clusters for efficient training. However, optimizing compute resource usage can be challenging due to I/O bottlenecks, kernel launch latencies, memory limits, and low resource utilizations. Inefficient hardware utilization and longer training times can increase costs and project timelines.

Prerequisites:
To start using SageMaker Profiler, you need a SageMaker domain in your AWS account. You also need to add domain user profiles for individual users to access the SageMaker Profiler UI application. The minimum set of permissions required for the execution role are listed.

Prepare and Run a Training Job with SageMaker Profiler:
To capture kernel runs on GPUs during the training job, modify your training script using the SageMaker Profiler Python modules. Import the library and add the start_profiling() and stop_profiling() methods to define the beginning and end of profiling. You can also use custom annotations to visualize hardware activities during specific operations. There are two approaches to profiling training scripts: profiling full functions or profiling specific code lines within functions. Examples and documentation are provided.

Configure the SageMaker Training Job Launcher:
Once you have annotated and set up the profiler initiation modules, save the training script and prepare the SageMaker framework estimator for training using the SageMaker Python SDK. Set up a profiler_config object using the ProfilerConfig and Profiler modules. Create a SageMaker estimator with the profiler_config object and other required parameters.

Start the Training Job:
Launch the training job by running the fit method on the estimator.

Launch the SageMaker Profiler UI:
When the training job is complete, you can access the SageMaker Profiler UI to visualize and explore the profile of the training job. Instructions are provided to launch the UI from the SageMaker console.

Gain Insights from the SageMaker Profiler:
Once you open the SageMaker Profiler UI, you can view a list of all submitted training jobs and search for a specific job. Load the profile of the desired job to generate the dashboard and timeline. The dashboard provides plots for key metrics such as GPU and CPU utilization over time, GPU kernel time, launch counts, step time distribution, and kernel precision distribution. These metrics help analyze GPU workload, utilization, bottlenecks, and other factors.

In conclusion, Amazon SageMaker Profiler is a powerful tool for monitoring and optimizing the resource usage of deep learning models trained on SageMaker. It provides detailed insights into CPU and GPU activities, helping ML practitioners improve efficiency and reduce costs.