AWS Virtual machine tutorial

Willem de Beijer and Daan Kolkman

This tutorial will take you through the steps for creating a data science virtual machine on Amazon Web Services. It is part of our Cloud Computing for Data Science series. A few notes before we get started with AWS:

  • Some AWS components will get blocked by Safari’s privacy features such as cross-site tracking or content blocking. If you’re on Safari, it is easiest to use just use Chrome instead

1. Creating an AWS account

Regular

The easiest way to set up an AWS account is to go to https://portal.aws.amazon.com/billing/signup. This gives you full access to the free tier of all services, but you will need to provide your credit card details before you can get started. You will automatically be billed once you exceed the free limits. 

Student

If you do not own a credit card and are a higher education student, you can create an AWS Educate account at https://aws.amazon.com/education/awseducate/. You will need to sign up with an .edu email address for this to be accepted though. While a student account is mainly meant for teaching purposes, you can also get limited access to the regular AWS services.

From the AWS educate panel, click on “Use an AWS Educate Starter Account” on the right. Please note that the starter account is hosted by third party Vocareum instead of Amazon itself.

Once your AWS Starter account is created, you will be taken to the Vocareum dashboard. A Starter account is valid for one year and comes with $30 free credits to spend on AWS. Clicking the orange “AWS Console” button will take you to the regular AWS dashboard. 

Student account limitations

A student account through Vocareum does not have root access like a normal account. Some services might not be available for you, and for some you might have to do a little more work to set up the required IAM accounts.

2. Setting up the VM

At this point we will assume that you are signed in to AWS and are looking to the dashboard as shown below. The process described below is identical for regular- and student accounts.

Search for “EC2” in the search bar. Then in the EC2 dashboard click on the “Launch Instance” button.You will now have the option to choose from the many software configurations for your VM, as shown below.

At this point, you have two options. Either you can configure your VM from scratch or use an AMI (Amazon Machine image) that comes with the data science libraries installed out of the box. If you want to learn how to configure your own VM, scroll down to option B. Otherwise keep reading under option A.

A. Pre-configured VM B. VM from scratch

A. Setting up a pre-configured VM

Besides blank images with just an operating system, we have the possibility to choose from images that come pre-configured with software such as Anaconda. This speeds up the setup process and is generally an easier approach. The BayesForge AMI is particularly useful for data science and can be found at https://aws.amazon.com/marketplace/pp/B06Y6BNHD3?qid=1563795949837&sr=0-6&ref_=srh_res_product_title. However, student accounts can only choose from a limited selection and this unfortunately does not include the BayesForge AMI. This tutorial will stick with the “Deep Learning AMI (Amazon Linux) Version 23.1” that can be found by searching for “deep learning” in the top bar. (Note: pick the Amazon Linux version, NOT the Ubuntu version)

B. Setting up the VM from scratch

For this option, we will stick with the “Amazon Linux AMI 2018.03.0 (HVM), SSD Volume Type” because it has better support than the Amazon Linux 2 option. Click “Select” on the right and you will be taken to the instance configuration page.

A + B. Finish the configuration

The default “t2.micro” option will be used in this tutorial since it’s free and it is enough to demonstrate the purpose. However, if you require a more capable machine you have a lot of options to choose from. The rest of the setup is the same regardless of what machine you choose. Please note that we will store our datasets in different AWS services, so “Instance Storage” is not that relevant for that purpose.

Keep the next few configuration settings as default and keep clicking “Next” until you find “Step 6: Configure Security Group”. 

Now click “Add Rule” to add a Custom TCP rule with port 8888 and source 0.0.0.0/0. You can ignore the security warning for now, we will make sure that no one else can access our VM later.

Click the “Review and Launch” button and then in the review screen click “Launch”. You will be shown the following prompt.

In the first dropdown select “Create a new keypair” and then give your key an easy to remember name. Then click “Download Key Pair” and “Launch Instances” afterward. It might take a few minutes before your instance is live.

Congratulations, your Virtual Machine is now up and running!

Connecting to the VM

Mac/LinuxWindows

Now that our virtual machine is working, we want to install Anaconda on it and start doing useful things. We will need the .pem file you just downloaded to access the VM, and it’s easiest to do so if you move this to your home directory. 

Now open up a Terminal window and execute the following command:

chmod 400 securitykey.pem

(Note: Make sure you’re in the directory of your .pem file and the name matches your pem file name)

Then in the AWS console in the description of your instance you will find a “Public DNS (IPv4)”, use this IP to execute the following terminal command (note that “ec2-user@” has to be added before the IP!):

ssh -i “securitykey.pem” ec2-user@ec2-54-145-61-4.compute-1.amazonaws.com

In case you get asked if you want to continue type “yes”.

The easiest way to manage SSH connections on Windows is by a tool called Putty. This tool can be downloaded at https://www.chiark.greenend.org.uk/~sgtatham/putty/. Once you’ve finished installing, open the app called “Puttygen”. 

Click “Load” and find the .pem key-file you generated on AWS. Note that you will need to select “All Files (*.*)” to be able to see the AWS key.

Click “Save private key”. Putty gives you the option to protect your key with a passphrase, but for this tutorial this is not required. Save the key in an easy to find location. You have now converted the .pem file to a format that Putty can read.

Now start the Putty app. In the category pane choose “Session”. In the Host Name textfield enter (replacing the Public DNS with that of your own instance, as shown AWS EC2 panel):

ec2-user@ ec2-184-73-146-172.compute-1.amazonaws.com

Ensure that the port is set to “22” and the connection type to “SSH”. 

In the left pane go to “Connection” -> ”SSH” -> “Auth” and click “Browse”. Now select your private key that we just generated. 

(Optional) To make life easier you can save the current configuration for later use. Go to “Session” in the left pane, enter a name under “Saved Sessions” and click the “Save” button.

Now on the bottom of the screen click “Open”. Putty will show a warning since this is the first time you are connecting to this VM, click “Yes” to continue. You will now be shown a terminal similar to the one below:

Setting up Anaconda (B only)

If you chose to use a vanilla VM and install all software yourself, it’s now time to setup Anaconda.

Now download Anaconda with:

wget https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh

If you’re using this tutorial a long time after it was written, you might want to get the latest Anaconda download link (but remember to get the Linux version!).

Install Anaconda with:

bash Anaconda3-2019.03-Linux-x86_64.sh

When shown the terms of agreement, hold down the ENTER key to scroll down. Continue to install in the default location.

At the end of the installation you will be asked if you want to prepend your Anaconda installation to the .bashrc PATH. When this shows up type “yes”.

If you accidentally just pressed enter without typing “yes”, correct with the following steps. If not, continue with the non-italic section below.

vim .bashrc

Your prompt should be change to show vim mode. Press the “i”-key to be able to type and add the following to the bottom of the file:

export PATH="/home/ubuntu/anaconda3/bin:$PATH"

Hit the ESC key to get out of edit mode, then type “:wq” and press ENTER to go back to the regular EC2 command line.

Now execute the following command to set Python 3 as the default:

source .bashrc

Check what version of Python the system is running with the following command:

python

(Should be version 3.x.x)

SSH configuration (A + B)

Mac/LinuxWindows

Our Virtual Machine is up and running and has Anaconda installed. Now it’s time to do something useful. Using an SSH connection, we can control our VM and make it run notebooks. To avoid having to enter the details about the identity file and hostname every time we want to connect, we can add these to the SSH settings.

For this we need to start up a fresh Terminal window in the Home directory (this is the default directory for a new Terminal window). Note: keep the other Terminal window open as we will still need it!

In your new Terminal window, execute:

vim .ssh/config

This will open an SSH configuration file in which we can enter our configuration details. Press the “i” key to start typing and paste the following text:

Host ec2
   Hostname ec2-54-145-61-4.compute-1.amazonaws.com
   User ec2-user
   IdentityFile ~/securitykey.pem

Make sure the IP address after “Hostname” matches the public IP address of your EC2 instance and that the filename behind “IdentityFile” matches the name and location of your .pem file.

Again press ESC to stop editing and “:wq” to save and exit vim mode.

No steps required.

3. Accessing your VM

Congratulations, your Virtual Machine is fully ready for use

Mac/LinuxWindows

To use it for a project, we will first have to start Jupyter Notebooks on our VM. If you still got the EC2 terminal open you can use that one. If not, you can open a new Terminal and SSH into your VM in the same way as we did when starting up Anaconda.

If your Terminal is still in Python mode (which you can recognize by the new line starting with “>>>” instead of your location), you can use the exit command to quit Python mode:

exit()

To start a Jupyter Notebook execute:

jupyter notebook --no-browser

Jupyter Notebooks is now running! Note that we have to keep this terminal open for as long as we want to keep Jupyter running.

Open up a new Terminal window on your local machine and execute the following command to connect to Jupyter Notebooks on our VM:

ssh -NfL 9999:localhost:8888 ec2

Note that the 9999 is an arbitrary port on our local machine, and we could change it to any other available port. The 8888 is the port on our VM.

Go to your browser and enter:

localhost:9999

The page you’re directed to will ask you for a token. This token can be found in the last line of your EC2 Terminal and was printed when you started Jupyter.

Open up a new Putty window. Either enter the same configuration details as we did when installing Anaconda, or go to “Sessions” in the left pane, click on your saved configuration and click “Load”.

In the left pane go to “Connection” -> “SSH” -> “Tunnels”. In the source port textfield enter “8888” and in the destination port enter “localhost:8888”. Now click “Add”.

Start Jupyter Notebooks in the new terminal by executing:

jupyter notebook --no-browser

Jupyter Notebooks is now running! Note that we have to keep this terminal open for as long as we want to keep Jupyter running.

Go to your browser and enter:

localhost:8888

The page you’re directed to will ask you for a token. This token can be found in the last line of your Putty terminal and was printed when you started Jupyter.

Enter this token and continue. You should now be looking at the familiar Jupyter interface like this:

You can now use Jupyter Notebooks like you normally would, except that everything that you do will be running on your VM! Please note that since you are now working on a new machine, you might have to re-install some packages that you’re used to on your own local device. If you’re using the deep learning AMI, there will be some packages pre-installed. You’ll also get some tutorial and example notebooks by default:

4. Storage using S3

As a data scientist you’ll probably want to work with your own datasets on your VM, so how to do so? Saving it to your VM itself might not be your best option, since this data will be removed every time you shut your instance down. Luckily AWS got our back here, since it’s super easy to tie together multiple services.

Go back to the main AWS console and search for S3. Your screen should look like this:

Click “Create bucket” to create a new bucket. A bucket is somewhat like a data lake or a folder on your computer. You put data of any format in there and then find it with your VM. The process of setting up a bucket is relatively straightforward, and for now you can just enter the name and leave everything else as default.

Click on your new bucket to open it and you will get an overview of what’s in there. You can also upload a dataset from your local machine to your S3 bucket through this interface. For this tutorial I uploaded the MNIST dataset in CSV format:

Data access authorization

The dataset can only be accessed by authorized users, and therefore we must provide some sort of authorization for our VM to access the data. In AWS this is done through a service called IAM. Go back to the main AWS console and search for “IAM”, the result should look somewhat like this:

Note that an AWS student account is limited in IAM capabilities. Normally we could create a security key and use that in our notebook to prove our authorization, but that function is blocked by Vocareum. We can however create a specialized role with certain permissions and provide our VM with that role. Go to “Roles” in the left menu and click “Create Role”.

Select “EC2” on top and continue by clicking “Next: Permissions”. Search the policies for “S3” and select the option “AmazonS3FullAccess”.

Continue by clicking “Next: Tags”. Tags can be useful to keep track of what resources are used for what projects but since we’re not interested in that right now, click “Next: Review”. Give your role an appropriate name and finish by clicking “Create role”.

Go back to the EC2 console where we created the VM. Select your EC2 instance and in the “Actions” dropdown go to “Instance Settings” and choose “Attach/Replace IAM Role”.

Now select your newly created IAM role and click “Apply”.

Importing and using the data

The data is now ready and we are authorized to use it in a Jupyter Notebook on our VM. To import the data we require some new packages, which can be installed in the Notebook that is running on the VM by executing:

! pip install boto3
! pip install s3fs

It turns out that the MNIST dataset was too large for the memory of a t2.micro instance so I also uploaded a .txt file to show the principle.

The data in our S3 bucket can be accessed relatively easy with pandas by importing boto3 and then opening the path “s3://<bucket-name>/<file-name>” with Pandas as shown below.

5. Final note

It might be tempting to just keep your VM running forever, but AWS will charge you for every second it’s online. You can shut it down in the EC2 console on AWS by right-clicking your instance and then under “Instance State” select “Stop”. If you want to restart it later simply do the same but instead selecting “Start”.

On restarting the VM, the public address will change and therefore this has to be changed in the .ssh/config file with vim as was shown earlier in this tutorial. If this bothers you, you can create a fixed IP for your VM with the bonus section of this tutorial.

6. Bonus: Setting up an elastic IP

To save yourself the hassle of having to reconfigure the IP address every time you restart your VM, you can use a service called Elastic IP. This will give you a fixed IP address that you can use for any AWS instance. Since AWS has a limited amount of IP addresses available and they want to avoid abuse, you will be charged $0.005 per hour (which is $3.60 per month) for an elastic IP address that is notconnected to a running EC2 instance. You will not be charged anything as long as the connected EC2 instance is running. This small price might be well worth the benefits if you regularly use your VM.

To set it up, go to the AWS EC2 console and navigate to “Elastic IPs” in the left pane. Click “Allocate new address”.

Click “Allocate” in the next window, leaving the settings at the default.

On the top of the EC2 console, click “Actions” -> “Associate address” with your IP address selected.

In the “Instance” dropdown choose your existing EC2 instance. Then click the blue “Associate” button on the bottom of the form.

The public IP address of your EC2 instance is now changed and visible in the “Instances” overview. Don’t forget to change the IP address in your SSH configuration as was done earlier in this tutorial.

Sources:

https://medium.com/@alexjsanchez/python-3-notebooks-on-aws-ec2-in-15-mostly-easy-steps-2ec5e662c6c6

Leave a comment



Frits de Raad

2 months ago

Ik heb vanmiddag en vanavond geprobeerd de VM voor AWS te configureren volgens de handleiding, maar liep uiteindelijk vast. AWS account aanmaken liep prima, putty downloaden, code genereren en connectie tot stand brengen werkt ook. Onder aan 2 staat nog een instructie ‘SSH configuratie (A+B)’, deze is alleen ingevuld voor Mac niet voor Windows. Wellicht dat het daarom in stap 3 bij mij niet lukte. Aanmaken SSH-tunnels lijkt nog wel te lukken, maar als ik dan in de terminal het commando jupyter notebook –no-browser geef dan start the terminal wel maar lijkt hij te blijven hangen, zie screenshot: (not, kan ik niet plakken of uploaden).

Als ik dan naar mijn browser ga (heb Chrome en Firefox geprobeerd) en ik type localhost:8888 of http://localhost:8888 krijg ik fout melding HTTP Error 404 The requested resource is not found.

Paar andere opmerkingen: Aan het einde van punt twee wordt gevraagd om te testen of commando: ‘Python’ werkt. Dat werkt niet, vanaf python 3.? moet dat met een kleine ‘p’ zijn, dan werkt het wel.
In eerste instantie werkte Notebook commondo helemaal niet, heb conda install jupyter moeten uitvoeren in de env om dat aan de gang te krijgen. Toen kreeg ik nog een foutmelding over een kernel of zo iets, daarvoor een pip install kernel-xxxxx comando ingegeven. Zoals je kunt zien in de screenshot geeft hij nog wel een melding over iets van kernels, maar weet niet of dat er iets mee te maken heeft.

Werk op een laptop met Windows 7 Pro.

Hopelijk hebben jullie wat aan de feedback.

Willem de Beijer

2 months ago

Hee Frits,

Zoals we laatst fysiek hebben besproken was het probleem dat poort 8888 bij jou al voor een ander process gebruikt wordt. Het poortnummer veranderen bleek de oplossing te zijn.

Ik zet het even hier neer, omdat er wellicht anderen tegen hetzelfde probleem aanlopen 😉

Mvg,
Willem

Neem contact op

  • Sint-Janssingel 92
    's-Hertogenbosch
  • info@jadsmkbdatalab.nl

Over ons

Het JADS MKB Datalab maakt data science bereikbaar voor iedereen. We voeren met studenten kortlopende projecten uit om organisaties te helpen waarde te halen uit hun data.

Copyright © 2017 All Rights Reserved.