This post serves as a short introduction to using AWS (Amazon Web Services) for data science. It should be used to supplement the official Amazon documentation.
AWS doesn’t give you direct access to the hardware. Instead, it hosts many virtual machines (VM for short) on a single box, allowing Amazon to fully utilize its hardware and serve several customers at once. Amazon calls these VMs “elastic container instances” and serves them via their Elastic Container 2 service.
EC2 has a free tier, but if you go over free usage limits or require more processing power, you’ll need to pay. If you’re a student, make sure you also sign up for the Github student pack, it offers $75-$150 of free AWS credits.
IAM users are used to separate privileges on EC2. It is best practice to not use a single all-powerful user for everything because if this user is compromised, the attacker gains admin access to the account. (If you’re sure you are the only person with access to your account, a single user should be ok)
SSH is a secure remote command line and file transfer program (via SFTP).
We use SSH to connect to EC2 instances. EC2 instances don’t normally support password authentication, so we use key pairs as an alternative (more on this in a moment).
(Optional) Create a Virtual Private Cloud (VPC)
If you don’t know what VPC is, you can probably use the default VPC.
Security groups are used to specify security-related attributes for groups of EC2 instances like exposed ports. For more info see Amazon EC2 Security Groups for Linux Instances
We only need to expose inbound SSH port 22 traffic. Additionally, if you plan to serve a webpage or iPython notebook from this instance, you can enable the ports for these specific services as needed.
Amazon Machine Images (AMIs) are snapshots of virtual machines that have been pre-configured for specific use cases. You can use the HVM edition of the Amazon Linux AMI and install all of the packages you need manually. See the Anaconda installation docs for installing the packages we use.
When selecting a tier, pick t2.micro
for the free tier. If you need more compute power, you will have to spend money or AWS credits. GPU instances can be useful for data science tasks.
Make sure you select free tier options if you don’t want to incur charges
If you are using Windows, you can install PuTTY for SSH.
Make sure you take your instance down when you’re done, or you will incur unexpected charges!