19 Oct 2016

Using AWS

This post serves as a short introduction to using AWS (Amazon Web Services) for data science. It should be used to supplement the official Amazon documentation.

AWS doesn’t give you direct access to the hardware. Instead, it hosts many virtual machines (VM for short) on a single box, allowing Amazon to fully utilize its hardware and serve several customers at once. Amazon calls these VMs “elastic container instances” and serves them via their Elastic Container 2 service.

Setting up EC2

  1. Get an AWS account

    EC2 has a free tier, but if you go over free usage limits or require more processing power, you’ll need to pay. If you’re a student, make sure you also sign up for the Github student pack, it offers $75-$150 of free AWS credits.

  2. Create an IAM user

    IAM users are used to separate privileges on EC2. It is best practice to not use a single all-powerful user for everything because if this user is compromised, the attacker gains admin access to the account. (If you’re sure you are the only person with access to your account, a single user should be ok)

  3. Create a Key Pair

    SSH is a secure remote command line and file transfer program (via SFTP).

    We use SSH to connect to EC2 instances. EC2 instances don’t normally support password authentication, so we use key pairs as an alternative (more on this in a moment).

  4. (Optional) Create a Virtual Private Cloud (VPC)

    If you don’t know what VPC is, you can probably use the default VPC.

  5. Create a security group

    Security groups are used to specify security-related attributes for groups of EC2 instances like exposed ports. For more info see Amazon EC2 Security Groups for Linux Instances

    We only need to expose inbound SSH port 22 traffic. Additionally, if you plan to serve a webpage or iPython notebook from this instance, you can enable the ports for these specific services as needed.

EC2 getting started

  1. Launch an instance

    Amazon Machine Images (AMIs) are snapshots of virtual machines that have been pre-configured for specific use cases. You can use the HVM edition of the Amazon Linux AMI and install all of the packages you need manually. See the Anaconda installation docs for installing the packages we use.

    When selecting a tier, pick t2.micro for the free tier. If you need more compute power, you will have to spend money or AWS credits. GPU instances can be useful for data science tasks.

    Make sure you select free tier options if you don’t want to incur charges

  2. Connect to your instance

    If you are using Windows, you can install PuTTY for SSH.

  3. Clean up your instance

    Make sure you take your instance down when you’re done, or you will incur unexpected charges!


Tags:
Stats: