Using Amazon SageMaker, HashiCorp Terraform, and GitLab CI/CD for MLOps: Batch Inference, Model Monitoring, and Retraining

Maintaining machine learning (ML) workflows in production can be challenging due to the various tasks involved, such as creating CI/CD pipelines, model versioning, monitoring for data drift, model retraining, and manual approval processes. To address these challenges, we present an MLOps workflow for batch inference using Amazon SageMaker, Amazon EventBridge, AWS Lambda, Amazon SNS, HashiCorp Terraform, and GitLab CI/CD.

The proposed MLOps architecture utilizes GitLab CI/CD and Terraform as the macro-orchestrators for managing model build and deployment pipelines. Amazon SageMaker Pipelines and the SageMaker Python SDK are used for creating and updating pipelines for training, hyperparameter optimization, and batch inference. Additional resources like EventBridge rules, Lambda functions, and SNS topics are created using Terraform to enable monitoring and notification functionalities.

The architecture follows a multi-account strategy, where ML models are built, trained, and registered in a central model registry within a data science development account. Inference pipelines are then deployed to staging and production accounts using GitLab CI/CD automation. The central model registry can also be placed in a shared services account.

Infrastructure as code (IaC) is used to manage AWS resources effectively, utilizing HashiCorp Terraform and GitLab CI/CD for efficient version control and repeatable processes in IT infrastructure management.

Model training and retraining are scheduled or triggered by events in Amazon S3. The training pipeline recalibrates the model with new data without introducing structural or material changes. The newly trained model version is registered in the model registry if it exceeds predefined performance thresholds. A notification is sent to the responsible data scientist for manual approval before the new model version can be used for inference.

Batch inference is performed using the latest approved model version from the model registry. The batch inference pipeline includes steps for data quality checks against a baseline and model quality checks based on ground truth labels. Notifications are sent to the data scientist if issues are detected.

Model tuning and retuning are triggered when the model quality check fails. The responsible data scientist can also trigger the process manually. Approval from the enterprise model review board is required before the new model version can be approved.

Data statistics and constraints baselines are generated during training and training with hyperparameter optimization. Amazon SageMaker Model Monitor is used for data quality checks, while custom Amazon SageMaker Processing steps are used for model quality checks.

After a newly trained model is registered, the responsible data scientist receives a notification. Approval is required for models trained with hyperparameter optimization, and the model registry status is updated accordingly. This triggers a Lambda function that updates the SageMaker batch inference pipeline to utilize the latest approved model version.

The data I/O design follows best practices in SageMaker, but further details are not provided in the content.