Amazon Glacier for backing up data?

0.007€ per GB and deletes and uploads are free. Only retrieval will cost.

So 500GB would be 3.50 € a month of practically absolutely safe and secure storage.

My current backup consumption per month is less than 60GB so that would cost me 0.42€. Which would become considerably more affordable than buying external hard drive and locating it somewhere to avoid fire damages and burglaries and other unforeseeable hazards.

And because deletes are free I can upload my backups there and issue delete for all the old ones and keep the cost minimal.

And since I have local copies of these backups, those at Amazon would only be downloaded back once something horrible happened. Then it would really not matter if it cost me 2 € to get them back.

Glacier is designed with the expectation that retrievals are infrequent and unusual, and data will be stored for extended periods of time. You can retrieve up to 5% of your average monthly storage (pro-rated daily) for free each month. If you choose to retrieve more than this amount of data in a month, you are charged a retrieval fee starting at $0.011 per gigabyte. Learn more. In addition, there is a pro-rated charge of $0.021 per gigabyte for items deleted prior to 90 days. Learn more.

So the minimum make-sense time to store backups is 3 months which is actually quite a good number. Coincidentally I store my backups on my local servers for the same 90 days. But because of this it might make sense for me to modify my backups so that full backups are taken only once every three months, and differentials for the remaining weeks.

Also need to change the backup scheme from full + differential to “incremental differential” so that every week, data is not referenced against the full backup, but the last differential, which is reference to the one before it, and so forth. This will save a lot of space.

This is probably not of Amazon’s tape libraries but nevertheless they will have similarly impressive devices.

robot

What’s the modern storing capacity for a tape? One terabyte per one?

Tape libraries can store a truly stupendous amount of data. Spectra makes a range of robotic devices that have 24 drives and space for 1,000 tapes — and these can then be networked together to create a library with a total capacity of 3.6 exabytes (3.7 million terabytes). It is likely that Amazon has built quite a few of these libraries at its data centers in the US, Europe, and Asia.

The tools

Currently looking into https://github.com/basak/glacier-cli as means for creating this backup scheme.

There is also official Amazon AWS CLI tool available: https://aws.amazon.com/cli/

And also another Glacier specific script: https://github.com/uskudnik/amazon-glacier-cmd-interface

Amazon AWS CLI

# aws glacier list-vaults --account-id -
{
    "VaultList": [
        {
            "SizeInBytes": 35062324157,
            "VaultARN": "arn:aws:glacier:eu-west-1:XXXXXXXXXXXXX:vaults/XXXXXX",
            "LastInventoryDate": "2014-03-29T04:14:56.091Z",
            "VaultName": "XXXXXX",
            "NumberOfArchives": 12,
            "CreationDate": "2014-03-21T08:26:53.193Z"
        }
    ]
}

Custom scripts

take-backup

This is the new backup script. It works with dar exactly like the previous one but this one uses concept of “archives” which are basically containers within which there is one full backup and then indefinite number of incremental differential backups.

Backups are split into 100MB blocks to make it comply with Amazon recommendations of using multipart upload for files larger than 100MB.

After about 8 or so hours of design and writing Bash scripts the take-backup script is now pretty much finished;

Request for integrity verification: /tmp/dest/0005/hp0
Testing /tmp/dest/0005/hp0.1.dar..
Request for integrity verification: /tmp/dest/0005/diff/hp0.2015-10-05.05:05:46.1444010746
Testing /tmp/dest/0005/diff/hp0.2015-10-05.05:05:46.1444010746.1.dar..
Starting incremental differential backup: hp0.2015-10-05.05:05:46.1444010746 -> hp0.2015-10-05.05:06:02.1444010762
Furtive read mode has been disabled as dar is not run as root
All done

It can do all sort of magic and verify the archives. Those are local archives but the script verifies them and raises errors if they have any problems.

create-archive

Simple script which can be configured to create new archive when maximum lifetime of archive has exceeded. The take-backup script then picks up the most recent archive and uses that.

upload-archive

Currently working on this script:

# ./upload-archive --archive=/tmp/boot/ --vault=backup0-hp0-0007 --create
Using existing vault: backup0-hp0-0007
Archive saved: /tmp/boot/System.map-2.6.32-431.29.2.el6.i686
Archive saved: /tmp/boot/System.map-2.6.32-431.20.3.el6.i686
Archive saved: /tmp/boot/symvers-2.6.32-431.17.1.el6.i686.gz
Archive saved: /tmp/boot/System.map-3.12.28-lxc.old
Archive saved: /tmp/boot/symvers-2.6.32-431.20.5.el6.i686.gz
Archive saved: /tmp/boot/config-2.6.32-431.23.3.el6.i686
Archive saved: /tmp/boot/config-2.6.32-431.20.5.el6.i686

Capable of creating archives and uploading material into archives. Glacier is a bit difficult to work with as you cannot really see what has happened but you must trust that what you have done works the way it has been described.

What that means is that there is no way to see your files. They simply are there. Perfect solution would be to verify that the files have been uploaded and I might look into that but for now the script only checks for exit status and if it is 0 then it writes another file onto the filesystem to signify that this file has been uploaded. So that it can continue if something goes wrong.

It is, however, possible to initiate these sort of jobs but they take time.

# aws glacier initiate-job --account-id - --vault-name work0 --job-parameters '{"Type": "inventory-retrieval"}'
# aws glacier list-jobs --account-id - --vault-name work0

delete-vault

This script will be able to take maximum archive age as a parameter and delete all the archives which are older. This way it will be possible to pass something like 8294400 as an argument to signify that archives older than 96 days should be deleted.

compress-archive

This is for compressing given archive.

others scripts

These are yet to be designed but someone needs to keep track of what’s in the cloud and how to deal with it.

All the scripts work independently and have to real ties to one another so they can run when they like and perform they tasks as they wish. If something fails it will also fail very gracefully and should not affect other scripts too much.

Amazon Glacier API

They have made it either robust or overly complicated, depending on how you want to think about it:

TreeHash-MPU

It adds another layer of computations and data processing so that is something that I could live without, but won’t bother me too much either.

If you are using an AWS SDK to program against Amazon Glacier, the tree hash calculation is done for you and you only need to provide the file reference.

Already started to write Python but apparently it isn’t necessary.

Well since I am not using AWS SDK but Bash then it seems it doesn’t apply to my situation: https://docs.aws.amazon.com/cli/latest/reference/glacier/upload-archive.html

Python seems to have package available in pip: https://pypi.python.org/pypi/TreeHash/1.0

And even better it is available as command so it can be used in these scripts.

Compression

Adding another machine to do the compression with xz using highest compression because my main machine responsible for backups is not powerful enough to do any serious compression.

http://catchchallenger.first-world.info//wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO

Gzip5 would be fast but xz -e saves quite a bit of storage space, which saves money.

If you look at the memory consumption of each algorithm you can clearly see when they were designed:

snap1000

May not apply to lz4 and lzop as those are by design very lean.

I am going to use pxz for parallel lzma compression https://jnovy.fedorapeople.org/pxz/

Local will always store the uncompressed files because they need to be referenced to but the highly compressed ones can be uploaded to Amazon.

Serious problems with pxz:

context size per thread: 201330688 B
hp0.13.dar -> 8/11 threads: [5 7 0 6 4 1 2 3 2 1 0 ] 2097152000 -> 3517960820 167.749%
context size per thread: 201330688 B
hp0.14.dar -> 8/11 threads: [4 6 5 1 3 7 0 2 2 1 0 ] 2097152000 -> 5497661344 262.149%

I cannot understand how it can possibly increase the filesize this much even if it was completely incompressible. If these two are not the single odd balls then I need to switch to bzip2 or something else.

There seem to be some sort of bug with the pxz because filesizes do not match with what’s being reported:

-rw------- 1 root root 2.0G Oct  6 05:37 hp0.13.dar.xz
-rw------- 1 root root 1.9G Oct  6 05:41 hp0.14.dar.xz
-rw------- 1 root root 2.0G Oct  6 05:45 hp0.15.dar.xz

Perhaps it has something to do with me giving it a list of files to compress, instead of launching new process for each file.

And that seems to be the problem because:

hp0.16.dar -> 8/11 threads: [1 0 6 7 2 5 3 4 2 0 1 ] 2097152000 -> 8050552768 383.880%
-rw------- 1 root root 484M Oct  6 06:02 hp0.16.dar.xz

Progress

Progress is slow but steady as new versions come out constantly and modifications to scripts and architecture are made, and need to be made because of the changes.

The scripts may also have been little overengineered as it will verify the whole full backup each time it makes differential. So if the initial backup was 50GB it will then calculate 50GB of sha256 to make sure it is intact. Sure that is secure but it is also extremely heavy.

It also seems that one full backup, compressed, took 42GB and if we expect sort of worst case scenario and 500MB of daily changes, which under certain conditions isn’t entirely unpausible, it would mean 42GB + (90*0.5GB) of data stored in Glacier. 87GB and we can round that to 100GB so it would be 0.70 € per month for the whole systems backup.

Also it turned out that Amazon Glacier doesn’t support filenames or anything so information about that must be stored locally and in description field of each file.

The most ideal case would perhaps be to use something like SQLite to create per backup-cycle (3 months) collection of all the filenames and checksums, and upload this too into the Glacier. It would consume no space so we could just create them and use the latest one if we ever need to restore any data.

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *