Amazon SageMaker Data Wrangler is a tool that significantly reduces the time needed for collecting and preparing data for machine learning. Instead of spending weeks on data preparation, Data Wrangler allows you to streamline the entire process within minutes. It provides a visual interface where you can perform various tasks such as feature engineering, data selection, purification, exploration, visualization, and processing at scale.
Data lakes are commonly used to store data, and with Data Wrangler, you can connect to data lakes managed by AWS Lake Formation. This integration enables you to implement fine-grained access control using simple grant or revoke procedures. Data Wrangler also supports fine-grained data access control with connections to Amazon Athena.
We are excited to announce that SageMaker Data Wrangler now supports using Lake Formation with Amazon EMR. This integration allows data professionals, like data scientists, to leverage the power of Apache Spark, Hive, and Presto running on Amazon EMR for fast data preparation. Traditionally, the learning curve for using these tools has been steep, but with Data Wrangler, you can connect to Amazon EMR and run ad hoc SQL queries on Hive or Presto with just a few clicks. This enables you to query data in the internal metastore or external metastore, such as the AWS Glue Data Catalog, and prepare data easily.
In this post, we demonstrate how to use Lake Formation as a central data governance capability and Amazon EMR as a big data query engine to enable access for SageMaker Data Wrangler. We provide an end-to-end use case using a sample dataset called the TPC data model, which includes transaction data for products, customer demographics, inventory, web sales, and promotions.
To showcase fine-grained data access permissions, we consider two users: David and Tina. David, a data scientist on the marketing team, is tasked with building a model on customer segmentation and is only allowed to access non-sensitive customer data. Tina, a data scientist on the sales team, is responsible for building a sales forecast model and needs access to sales data for a specific region. Additionally, Tina is involved in product innovation and requires access to product data as well.
The solution architecture involves using Lake Formation to manage the data lake, Amazon EMR to query the data and perform data preparation using Spark, and AWS Identity and Access Management (IAM) roles to manage data access using Lake Formation. SageMaker Data Wrangler serves as the single visual interface for interactively querying and preparing the data.
To set up the solution, you can use the provided AWS CloudFormation stack, which deploys all the necessary components. Before getting started, ensure that you have an AWS account, an IAM user with administrator access, and an S3 bucket. The CloudFormation template provisions resources such as the data lake S3 bucket, EMR cluster, IAM roles, SageMaker Studio domain, and user profiles.
Additionally, if you want to encrypt data in transit, you can create PEM certificates using OpenSSL and upload them to an S3 bucket. The CloudFormation template allows you to specify the S3 URI for the uploaded certificate file.
Once the CloudFormation stack is created, you can test the data access permissions for the two user profiles. For example, you can launch SageMaker Studio with David’s user profile and use Data Wrangler to import and prepare data visually. This allows you to verify that David only has access to non-sensitive customer data. Similarly, you can test Tina’s user profile to ensure she has access to sales and product data.
By following these steps, you can leverage the capabilities of Amazon SageMaker Data Wrangler, AWS Lake Formation, and Amazon EMR to efficiently collect and prepare data for machine learning, while maintaining fine-grained access control and data governance.