ssh into an AWS EMR cluster from scratch

2 minute read

Recently, I’ve been using AWS EMR to pre-process gigabytes of data and train ML models in PySpark. Sometimes I need to access the terminal in the EMR master node for multiple reasons, such as to install a certain Python package, configure a broken Pyspark path and many more. I tried connecting to an existing EMR cluster with the following method (refer to screenshot).

And I was greeted by this error

Step: ssh -i "xxx.pem" [email protected]

Error: ssh: connect to host ec2-52-62-xxx-xxx.ap-southeast-2.compute.amazonaws.com port 22: Operation timed out

After some sleuthing, I discovered that the error is because my EMR cluster is not attached to any VPC (duh). Sadly, there aren’t many resources out there that show a clear guide to it. The most relevant documentation I could find is from AWS, but it would be great if they had images, hence I decided to write this guide.

Another high-level diagram.

If you’re just like me and have minimal knowledge about networking (literally, all I know is that each device has an IP address), the diagram above gives you enough high-level context to understand the rest of this guide. Main takeaways: (1) One VPC consists of 1 or more public or private subnets. (2) You need an internet gateway and a route table to connect to the outside world. Let’s get into it!

1. Set up VPC

VPC set-up page

I can already tell what your first question is – why is IP address 192.168.0.1, specifically? Well, based on my research 192.168.0.1 is the most common address to access and configure wireless routers from a web browser, so it’s a safe bet. The “/16” is called the CIDR notation, which is a different way to express the subnet mask. If you are interested to dive deeper, here is an in-depth explanation of CIDR notation and choosing the right CIDR block.

2. Set up Subnet

Subnet set-up page

Note that the CIDR notation for a subnet (/24) should be larger than a VPC (/16). Refer to this AWS documentation to understand why.

3. Set up Internet Gateway (IGW)

Attaching a IGW to a VPC

After creating a VPC and subnet, create a new IGW, and make sure you attach that to the VPC you just created.

4. Set up Route Table

Associate the newly created IGW with the route table

Create a new route for 0.0.0.0/0 using the same IGW that you created in the previous step. This allows to VPC to accept incoming or outgoing IPv4 connections.

5. Create an EMR cluster

Configuring network settings while spinning up a new EMR cluster

When creating an EMR cluster, configure your networking settings based on what we did earlier.

6. Edit Security Groups

Add a new inbound rule

Once the cluster is ready, copy the VPC ID and go to EC2 > Security Groups. Add an inbound rule 0.0.0.0/0 to allow all incoming IPv4 connections. If you’d like to be more secure, you could also just add the IP address that you’re accessing the EC2 from.

7. Connect to EMR using ssh

Yay, finally!

You can use an existing or new EC2 key pair to ssh into EMR. Congrats, you’re in!

8. Set up SSM (Optional)

Once you have successfully established an ssh connection, I highly recommend using a Systems Session Manager (SSM) if possible to get rid of the networking headaches that we endured earlier. But I’ll leave that to another time.

Updated:

Comments