Amazon Redshift is a SQL-based data warehouse whereas Amazon Sagemaker is the main machine learning platform on AWS. A machine learning project that involves massive datasets would usually require these two services, hence its important to understand how to ingest data from Redshift to Sagemaker.
This guide is written with the following assumptions:
You have created a Redshift cluster that in enclosed within a Virtual Private Cloud (VPC)
You have created tables within the Redshift cluster
You have created a secret that contains Redshift credentials in the SecretsManager
Approach 1: Traditional Method
1. Create and verify IAM role
Go to IAM > Roles, then create a role that allow access to the following micro services:
Sagemaker
Redshift
SecretsManager
In addition, verify if you have added a Trust Relationship for both Redshift and Sagemaker, as shown below.
If you aren’t sure about this step, refer to my previous blog post.
2. Create Jupyter notebook on Sagemaker
Once you have sorted the permissions, it’s time to create a Jupyter Notebook on Sagemaker! Ensure that the notebook has the same VPC as Redshift’s.
3. Establishing a connection
Run this code on Jupyter, replacing the square brackets with your own information
4. Test with a SQL script
Approach 2: Redshift Data API
If you can’t be bothered with configuring IAM and VPC settings, there is a simpler alternative — Redshift Data API. I haven’t written a guide for this method because AWS developers have already written a comprehensive documentation.
Comments