Introducing Amazon S3 Access Point Integration for Amazon SageMaker Data Wrangler

We are thrilled to announce that Amazon SageMaker Data Wrangler now supports Amazon S3 Access Points. With its user-friendly interface, SageMaker Data Wrangler simplifies the process of data preparation and feature engineering, including data selection, cleansing, exploration, and visualization. On the other hand, S3 Access Points make data access easier by providing unique hostnames with specific access policies.

Starting today, SageMaker Data Wrangler makes it easier for users to prepare data from shared datasets stored in Amazon Simple Storage Service (Amazon S3) while allowing organizations to securely control data access within their organization. With S3 Access Points, data administrators can create application- and team-specific access points, facilitating data sharing without the need to manage complex bucket policies with multiple permission rules.

In this article, we will guide you through the process of importing data from and exporting data to an S3 access point in SageMaker Data Wrangler. This solution overview will help you understand how to streamline data management for multiple data science teams while ensuring data security and access control.

Traditionally, setting up granular access control with bucket policies has been a challenge, as they apply the same permissions to all objects within a bucket. Moreover, bucket policies cannot support securing access at the endpoint level. However, S3 Access Points solve these problems by granting fine-grained access control at a granular level. This makes it easier to manage permissions for different teams without affecting other parts of the bucket.

Instead of modifying a single bucket policy, you can now create multiple access points with individual policies tailored to specific use cases. This reduces the risk of misconfiguration or unintended access to sensitive data. Additionally, you can enforce endpoint policies on access points to define rules that control which Virtual Private Clouds (VPCs) or IP addresses can access the data through a specific access point.

To use S3 Access Points with SageMaker Data Wrangler, follow these steps:

1. Upload your data to an S3 bucket.
2. Create an S3 access point.
3. Configure your AWS Identity and Access Management (IAM) role with the necessary policies.
4. Create a SageMaker Data Wrangler flow.
5. Export data from SageMaker Data Wrangler to the access point.

In this article, we use the Bank Marketing dataset as an example, but you can use any other dataset you prefer. Make sure you have uploaded your data to an S3 bucket before proceeding.

To create an S3 access point, follow these steps:

1. Go to the Amazon S3 console and navigate to Access Points.
2. Choose “Create access point.”
3. Enter a name for your access point and select the bucket you created.
4. Leave the remaining settings as default and choose “Create access point.”
5. Note down the Amazon Resource Name (ARN) and access point alias for later use in SageMaker Data Wrangler.

Next, you need to configure your IAM role to use S3 access points. If you have a SageMaker Studio domain, follow these steps:

1. Go to the SageMaker console and navigate to Domains.
2. Choose your domain.
3. On the Domain settings tab, choose “Edit.”
4. Add the following two policies to the IAM role:
– Policy 1: Grants SageMaker Data Wrangler access to perform PutObject, GetObject, and DeleteObject.
– Policy 2: Grants SageMaker Data Wrangler access to get the S3 access point.
5. Create these two policies and attach them to the role.

After configuring your IAM role, you can create a new SageMaker Data Wrangler flow by following these steps:

1. Launch SageMaker Studio.
2. Choose “New” and select “Data Wrangler Flow.”
3. Choose “Amazon S3” as the data source.
4. Enter the S3 access point using the ARN or alias you noted down earlier.

Finally, to export data from SageMaker Data Wrangler to an S3 access point, complete the following steps:

1. In the data flow, choose the plus sign.
2. Choose “Add destination” and select “Amazon S3.”
3. Enter the dataset name and the S3 location, referencing the ARN.

By using S3 Access Points in SageMaker Data Wrangler, you can securely and efficiently import and export data without the need for complex bucket policies or navigating multiple folder structures.

Remember to clean up after you have finished experimenting. This includes stopping any running apps, deleting your domain to avoid charges, and deleting any S3 access points and buckets.

In conclusion, the availability of S3 Access Points for SageMaker Data Wrangler simplifies data control within SageMaker Studio. By leveraging this feature, you can enhance data access for your SageMaker Studio users and streamline data management processes. We encourage you to give it a try!

About the authors:
– Peter Chung is a Solutions Architect at AWS, specializing in helping customers use technology to solve business problems.
– Neelam Koshiya is an Enterprise Solution Architect at AWS, assisting enterprise customers with their cloud adoption journey for strategic business outcomes.