Button Button Button Button

This is going to be very nerdy, and probably not at all interesting for those who haven’t at least heard of Amazon S3, Elastic Compute Cloud (EC2) and rsync. But in case you are curious, here is a brief summary. In addition to selling books, Amazon.com has in the last few years been a pioneer in the area of cloud computing. First they started offering fairly cheap replicated storage “in the cloud” (Amazon S3). From very beginning S3 looked like potentially a great option for backing up stuff – at least for the nerds. However, the interface to it was too limited to be practical. Then Amazon started offering computing power in the cloud (EC2). As I've been revisiting my backup strategy recently, I started thinking about the possibility of running rsync on EC2 and backing up to S3 through it. rsync is a clever UNIX program that makes it possible to very efficiently synchronize files between too computers. Basically, the two computers talk about what each of them has, figure out what files changed and only send those (or even parts of them). This means that if you've got gigabytes of stuff and change a few files here and there, you can let rsync figure out what has changed and update this. And now you can also use rsync to back up stuff into Amazon’s storage. But at this point it’s not yet for the faint of heart.

So, the idea is to be able to do the following steps:

  1. Create an machine instance with EC2
  2. Attach a volume to it.
  3. Mount the volume.
  4. rsync some directories to the volume mounted on the remote machine
  5. Unmount and detach the volume.
  6. Terminate the machine.
  7. Tell Amazon to create a snapshot into S3 for extra security.

This backup solution ends up costing $.10 per hour while rsync is running (which would be quite a few hours the first time, but quick after that) + $.10 per month per gigabyte of allocated space + $.15 per month per gigabyte of the snapshot size. I haven’t gotten by first bill yet, but I expect it to come out to about $3 a month for backing up around 10GB. (If you only backup a gigabyte of data, you should be paying under $1 a month.)

While the setup process is a little tedious, once you are done, you should be able to do backup with a script that would look like this:

### Setup EC Tools - replace with your paths and keys
export EC2_HOME=~/apps/ec2tools  
export EC2_PRIVATE_KEY=~/.ec2/pk-FLKSO8372HSJALPSQJHFKS92387DKADG.pem 
export JAVA_HOME=/usr/lib/jvm/java-6-sun/jre/

### Create an instance
export EC2_INSTANCE=`ec2-run-instances ami-179e7a7e -k gsg-keypair -z us-east-1a \
    | tr '\t' '\n' | grep '^i-'`
sleep 20
export EC2_HOST=`ec2-describe-instances | grep $EC2_INSTANCE | tr '\t' '\n' \
    | grep amazonaws.com`

### Attach and mount the volume
export EC2_VOLUME=vol-1134d178
ec2-attach-volume $EC2_VOLUME -i $EC2_INSTANCE -d /dev/sdh
ssh -i id_rsa-gsg-keypair root@$EC2_HOST \
    "mkdir /mnt/data-store && mount /dev/sdh /mnt/data-store"

### Run rsync
rsync -e "ssh -i id_rsa-gsg-keypair" -avz /some/local/directory/ \
    root@$EC2_HOST:/mnt/data-store/directory/ > out.txt

### Clean up
ssh -i id_rsa-gsg-keypair root@$EC2_HOST "umount /mnt/data-store"
ec2-detach-volume $EC2_VOLUME -i $EC2_INSTANCE
ec2-terminate-instances $EC2_INSTANCE

Those instructions are provided without any warranty, use at your own risk.

Setup Step 1: Sign Up for the Services

Before you can use Amazon Web Services, you need to sign up for an account at http://aws.amazon.com/.

This will give you two keys that you will need for accessing the services: your “access key” and your “secrete access key”. Then sign up for two services: Elastic Compute Clode (EC2) and S3. As part of signing up for EC2, you will create an X.509 certificate. You will need to save your private and public keys in ~/.ec2. All this is described in more detail in AWS “Getting Started Guide” under Setting up an account.

Setup Step 2: Install EC2 Tools

The process of setting up the tools is described under Setting up the Tools in AWS “Getting Started Guide”. Just do what that page says.

Setup Step 3: Create an EBS Volume

Elastic Block Storage is a storage system that can be mounted on EC2 servers. Unlike S3, EBS storage is not replicated and is somewhat less reliable, though presumably still better than what most hosting companies would offer you (or than what you would expect from an average hard drive). It’s also a little cheaper than S3 ($0.10 per GB per month rather than $0.15). On the other hand, you need to decide upfront how big it’s going to be and you get charged for what you allocate. While in theory you can mount S3 buckets as directories, having tried a few tools, I found all fo them fragile. So, my recommendation is to use EBS for mountable storage and create S3 snapshots for additional reliability. (See below.)

To create a 1GB storage block:

ec2-create-volume -s 1 -z us-east-1a

This will give you back a volume ID, something like “vol-5f34d136”. Save it in an environment variable:

export EC2_VOLUME=vol-5f34d136

(If you later need this id again, you can look it up with ec2-describe-volumes.)

For a larger block (and you will probably need more than a gigabyte eventually), pass a different size (in GB) with the -s parameter. You can make volumes of up to 1 terabyte. Amazon will charge you $0.10 per GB per month, so a 10GB volume will be $1 per month.

The second parameter (“-z us-east-1a”) is the “availability zone” where the storage would be created. We'll have to use the same one later when creating the machine instance.

Once you create the storage you will be charged $0.10 per gigabyte per month for it until you destroy it (with “ec2-delete-volume”).

Setup Step 4: Create an Instance and Format the Volume

This only needs to be done once: we'll create a machine instance so that we could format the volume we just created.

First of all, you'll need to set up a key to log into an EC2 machine. Follow the section called “Generating an SSH Keypair” in Running an Instance. “Getting Started Guide” tells you how to create an instance of the default machine. You can do what they tell you or create an instance of Ubuntu by using this AMI instead: ami-179e7a7e. In either case, you will specify the same availability zone as your volume (for instance with “-z us-east-1a”).

Instead of doing all this manually as explained in the Guide, you can use export, tr and grep to save the instance ID and the machine’s host name into environment variables.

First the instance ID:

export EC2_INSTANCE=`$EC2_HOME/bin/ec2-run-instances ami-179e7a7e \
    -k gsg-keypair -z us-east-1a | tr '\t' '\n' | grep '^i-'`

Then the host name, after waiting 20 seconds or so:

export EC2_HOST=`$EC2_HOME/bin/ec2-describe-instances | grep $EC2_INSTANCE \
    | tr '\t' '\n' | grep amazonaws.com`
echo $EC2_HOST

If the last commands didn’t give you anything, wait a little more and try the last two commands again.

Now attach the volume:

ec2-attach-volume $EC2_VOLUME -i $EC2_INSTANCE -d /dev/sdh

Then use SSH to format the drive:

ssh -i id_rsa-gsg-keypair root@$EC2_HOST "mkfs -t ext3 /dev/sdh"

(You shouldn’t be asked for a password at this point, if you are, go back to “Generating an SSH Keypair” in Running an Instance.)

We could now continue using the same instance to do backup, but to make things simpler, let’s terminate it now and make a new one in the next section:

ec2-detach-volume $EC2_VOLUME -i $EC2_INSTANCE
ec2-describe-volumes | grep ATTACHMENT # This should return nothing.
ec2-terminate-instances $EC2_INSTANCE

Actual Backup

To do the actual backup, you'll need to use the following commands. (Those commands will need to be used every time you want to do backup.)

First, create an instance of Ubuntu and attach the volume to it. Nothing new here, we already did this during setup.

export EC2_INSTANCE=`ec2-run-instances ami-179e7a7e -k gsg-keypair \
    -z us-east-1a | tr '\t' '\n' | grep '^i-'`
export EC2_HOST=`ec2-describe-instances | grep $EC2_INSTANCE \
    | tr '\t' '\n' | grep amazonaws.com`
echo $EC2_HOST
ec2-attach-volume $EC2_VOLUME -i $EC2_INSTANCE -d /dev/sdh

Now mount the volume as “/mnt/data-store” (you can call it anything you want, though):

ssh -i id_rsa-gsg-keypair root@$EC2_HOST \
    "mkdir /mnt/data-store && mount /dev/sdh /mnt/data-store"

Now you can do the actual backup:

rsync -e "ssh -i id_rsa-gsg-keypair" -avz /some/local/directory/ \
    root@$EC2_HOST:/mnt/data-store/directory/ > out.txt

Try a small directory first. Note that at this point rsync will be copying all the files, since the remote server doesn’t have any of them.

Then unmount the storage:

ssh -i id_rsa-gsg-keypair root@$EC2_HOST "umount /mnt/data-store"

Note that you have to unmount the volume to avoid corrupting it. So, check that it actually got unmounted properly:

ssh -i id_rsa-gsg-keypair root@$EC2_HOST "ls /mnt/data-store/

This should return nothing.

Finally, detach the volume and terminate the instance:

ec2-detach-volume $EC2_VOLUME -i $EC2_INSTANCE
ec2-describe-volumes | grep ATTACHMENT # should return nothing
ec2-terminate-instances $EC2_INSTANCE

Check that you don’t have any instances running:

ec2-describe-instances | grep INSTANCE | grep -v terminated

At this point, your stuff should be backed up on the volume, for which you will be billed $0.10 per gigabyte per month. However, your machine instance (which costs you $0.10 per hour) should be off, and thus not costing you anything. On a good day, when rsync takes less than an hour to run and doesn’t transmit too much data, the whole backup procedure will cost you just a little over a dime.

Creating a Snapshot

Amazon documentation says that EBS storage should be more reliable than your typical hard drive, but it’s not as reliable as S3. So, for extra security, you would want to take a snapshot of your volume and save it in S3:

ec2-create-snapshot vol-5f34d136

This will schedule the snapshot, which you can later use to re-create the volume — see Elastic Block Store Feature Guide.