Store JSON data in JSON database

Makes sens doesn’t it? No sense storing JSON data in RDBMS because it cannot be queried. Or if you want to query it you will either have to unwrap it to RDBMS ACID compatible “table hell” which is permanent to say the least, or write some horrifically inefficient additional code to do the interpretation of the data.

So JSON data should be stored in JSON database.

Install Apache CouchDB 1.6.1 on RHEL/CentOS/Fedora and Debian/Ubuntu

CouchDB is my choice for now.  Sadly no compiled RPM seem to exist so compilation from source is required. Not a problem but would have preferred RPM to keep everything neat and tidy.

Windows architecture flaw

There is one serious architectural flaw in Windows.

I begun installing SQL Server 2014 and it told me all dependencies were satisfied and it let me choose all the features available.

But then at the end it shows me this:

snap1212

And in the log file it says

An error occurred for a dependency of the feature causing the setup process for the feature to fail.

So a) it could not tell me that it cannot meet some of the dependencies and b) it gives me no indication what that dependency might be.

Contrast this to any Linux package management system where dependencies are checked before any installation ever begins. And if dependencies are not met then those package management frameworks can add those missing dependencies and install them too, and then install the final product with all dependencies met.

So Windows/Microsoft does not keep any proper track of their dependencies it seems. The system is pretty solid with its next-next-finish principle, but then it cannot deliver it end-to-end and fails for something as trivial as dependencies.

They have the Microsoft Update system so why cannot they use that to satisfy all the dependencies? I have already checked one place to allow Windows to contact that system but apparently SQL Server a) won’t do it or b) those dependencies are not available, in which case, or in either and every case, it should give me bit more information that just say that dependencies were not met.

Because now what it makes me to do, is that I need to seek myself what those dependencies are. In Linux world the package managemtn software and every existing software tells you exactly what dependencies you are missing.

SQL Server even gives me something called System Configuration Check Report and everything is marked either Passed or Not applicable so why did it still fail? It makes no sense.

It couldn’t even install Database Engine Services which sounds pretty damn important. So what was the point of initial dependency check if it did practically nothing.

Update

https://stackoverflow.com/questions/27489182/sql-server-2014-installation-fails-an-error-occurred-for-a-dependency-of-the-fe

There seem to be something called Detail.txt there, and it is 2.44MB file so perhaps that will tell me what failed. But it surely is well hidden and the installation program says nothing about this file.

Another interesting thing is this:

snap1213

And only because it took so long. Did it seek the hard drive for updates? Instead of looking from some index? Or did it verify the integrity of all those updates, and that’s why took so long? It shouldn’t take long to fetch the list of installed updates.

Modernization of storage Part II (ownCloud)

So in previous article I talked about my need to modernize my storage system and move to cloud based one, and then discussed briefly about ownCloud and deemed Pydio to be of poor quality.

Then I wanted to look if the one other alternative, ownCloud was any better and to my surprise its code base is much cleaner and it has “deeper” structure.

And because ownCloud provides the same sync feature and it provides clients for basically every platform, mobile ones included, this is definitely better choice.

But more on this hopefully later at some point.

Modernizing personal data storage system

Looking into moving some sort of Web based personal storage system, such as https://pydio.com/en which looks extremely good on the surface, haven’t had chance to try it yet.

Another one which came about was ownCloud but to be perfectly honest it doesn’t seem as finished and clean as the other one.

One thing all these must have is they must use the underlying filesystem and store files in some sensible way. By which I mean that in case anything ever goes wrong there must be simple way to migrate the data to another system or ditch any management systems all together.

In other words, any system worth considering should simply store the files on filesystem and act as Web based GUI to it, with some intelligence such as indexing and searching and other things of that nature.

Or if not then the system must be absolutely fool proof, exactly like any ordinary filesystem pretty much is.

But I think I am sold for Pydio because if you look what they say, it seems pretty awesome:

pydio-features

Because just as I am writing this text at not home, and storing these images on Mega.nz cloud, so that the files are available also at home; Pydio has this exact same feature which makes it extremely powerful!

So you can create your own personal storage cloud and use it everywhere, securely.

Pydio also has demo available at https://demo.pyd.io/ (demo / demo)

But the source of at least for this driver does not impress me at all:

https://github.com/pydio/pydio-core/blob/develop/core/src/plugins/access.fs/class.fsAccessDriver.php

Check the size of those functions. So while it on surface looks just fine, under the hood it is a mess. And I very much detest using switch-statement like that.

That’s a fast way to develop because it doesn’t require any thought into structure of the code. But the result is very poor and quality suffers.

And after seeing that I am not that trusting towards this system any more.

The same style continues elsewhere: https://github.com/pydio/pydio-core/blob/develop/core/src/index.php

And they have had time to paste the license at the beginning but not enough to document the code properly. Also the question is why haven’t they done this with some existing framework but instead rolled their own?

So those are some improvement points.

Studying Microsoft SQL Server 2012 with some great books

New challenged and these require studying. Started with SQL Server 2012 T-SQL Recipes, 3rd Edition
but because it relies on deeper knowledge of the database itself, it was required to switch it to Microsoft SQL Server 2012 Internals which is deeper and more technical book about the internal operation of the database engine.

Untitled

The Recipes book is from Apress and the technical internals from Microsoft, which should be very in-depth read.

Untitled2

This book is intended to be read by anyone who wants a deeper understanding of what SQL Server does behind the scenes. The focus of this book is on the core SQL Server engine—in particular, the query processor and the storage engine.

The one problem for me, with Microsoft and Windows based products is, that it is “scary” to not know what happens behind the scenes. In Unix and Linux you can always be aware of what happens but in Microsoft world it isn’t as straight; you simply press buttons, and expect good things to happen.

What if when bad things happen? What are you supposed to do? Surely there are error logs but the mentality is different. Supposedly their systems are meant to be used they are, and when that is done they function properly.

And when things fail, they supposedly fail gracefully and fix themselves. But when they don’t — that’s where the problems begin.

Also some of my personal databases will move to Microsoft platform simply because there is a need to use and manage Microsoft systems so that is the best way to go.

Getting started with Microsoft SQL Server 2012

Free PDF book available for download from here.

Also directly from this link: http://download.microsoft.com/download/F/F/6/FF62CAE0-CE38-4228-9025-FBF729312698/Microsoft_Press_eBook_Introducing_Microsoft_SQL_Server_2012_PDF.pdf

Long 250 page book into everything essential.

Microsoft SQL Server 2012 is Microsoft’s first cloud-ready information platform. It gives organizations effective tools to protect, unlock, and scale the power of their data, and it works across a variety of devices and data sources, from desktops, phones, and tablets, to datacenters and both private and public clouds. Our purpose in Introducing Microsoft SQL Server 2012 is to point out both the new and the improved capabilities as they apply to achieving mission-critical confidence, breakthrough insight, and using a cloud on your terms.

As you read this book, we think you will find that there are a lot of exciting enhancements and new capabilities engineered into SQL Server 2012 that allow you to greatly enhance performance and availability at a low total cost of ownership, unlock new insights with pervasive data discovery across the organization and create business solutions fast—on your terms.

But seems to be specifically SQL Server 2012 book so doesn’t go into too many details on how SQL Server generally works.

Lots of data but not really helpful for beginner.

Amazon Glacier for backing up data?

0.007€ per GB and deletes and uploads are free. Only retrieval will cost.

So 500GB would be 3.50 € a month of practically absolutely safe and secure storage.

My current backup consumption per month is less than 60GB so that would cost me 0.42€. Which would become considerably more affordable than buying external hard drive and locating it somewhere to avoid fire damages and burglaries and other unforeseeable hazards.

And because deletes are free I can upload my backups there and issue delete for all the old ones and keep the cost minimal.

And since I have local copies of these backups, those at Amazon would only be downloaded back once something horrible happened. Then it would really not matter if it cost me 2 € to get them back.

Glacier is designed with the expectation that retrievals are infrequent and unusual, and data will be stored for extended periods of time. You can retrieve up to 5% of your average monthly storage (pro-rated daily) for free each month. If you choose to retrieve more than this amount of data in a month, you are charged a retrieval fee starting at $0.011 per gigabyte. Learn more. In addition, there is a pro-rated charge of $0.021 per gigabyte for items deleted prior to 90 days. Learn more.

So the minimum make-sense time to store backups is 3 months which is actually quite a good number. Coincidentally I store my backups on my local servers for the same 90 days. But because of this it might make sense for me to modify my backups so that full backups are taken only once every three months, and differentials for the remaining weeks.

Also need to change the backup scheme from full + differential to “incremental differential” so that every week, data is not referenced against the full backup, but the last differential, which is reference to the one before it, and so forth. This will save a lot of space.

This is probably not of Amazon’s tape libraries but nevertheless they will have similarly impressive devices.

robot

What’s the modern storing capacity for a tape? One terabyte per one?

Tape libraries can store a truly stupendous amount of data. Spectra makes a range of robotic devices that have 24 drives and space for 1,000 tapes — and these can then be networked together to create a library with a total capacity of 3.6 exabytes (3.7 million terabytes). It is likely that Amazon has built quite a few of these libraries at its data centers in the US, Europe, and Asia.

The tools

Currently looking into https://github.com/basak/glacier-cli as means for creating this backup scheme.

There is also official Amazon AWS CLI tool available: https://aws.amazon.com/cli/

And also another Glacier specific script: https://github.com/uskudnik/amazon-glacier-cmd-interface

Amazon AWS CLI

# aws glacier list-vaults --account-id -
{
    "VaultList": [
        {
            "SizeInBytes": 35062324157,
            "VaultARN": "arn:aws:glacier:eu-west-1:XXXXXXXXXXXXX:vaults/XXXXXX",
            "LastInventoryDate": "2014-03-29T04:14:56.091Z",
            "VaultName": "XXXXXX",
            "NumberOfArchives": 12,
            "CreationDate": "2014-03-21T08:26:53.193Z"
        }
    ]
}

Custom scripts

take-backup

This is the new backup script. It works with dar exactly like the previous one but this one uses concept of “archives” which are basically containers within which there is one full backup and then indefinite number of incremental differential backups.

Backups are split into 100MB blocks to make it comply with Amazon recommendations of using multipart upload for files larger than 100MB.

After about 8 or so hours of design and writing Bash scripts the take-backup script is now pretty much finished;

Request for integrity verification: /tmp/dest/0005/hp0
Testing /tmp/dest/0005/hp0.1.dar..
Request for integrity verification: /tmp/dest/0005/diff/hp0.2015-10-05.05:05:46.1444010746
Testing /tmp/dest/0005/diff/hp0.2015-10-05.05:05:46.1444010746.1.dar..
Starting incremental differential backup: hp0.2015-10-05.05:05:46.1444010746 -> hp0.2015-10-05.05:06:02.1444010762
Furtive read mode has been disabled as dar is not run as root
All done

It can do all sort of magic and verify the archives. Those are local archives but the script verifies them and raises errors if they have any problems.

create-archive

Simple script which can be configured to create new archive when maximum lifetime of archive has exceeded. The take-backup script then picks up the most recent archive and uses that.

upload-archive

Currently working on this script:

# ./upload-archive --archive=/tmp/boot/ --vault=backup0-hp0-0007 --create
Using existing vault: backup0-hp0-0007
Archive saved: /tmp/boot/System.map-2.6.32-431.29.2.el6.i686
Archive saved: /tmp/boot/System.map-2.6.32-431.20.3.el6.i686
Archive saved: /tmp/boot/symvers-2.6.32-431.17.1.el6.i686.gz
Archive saved: /tmp/boot/System.map-3.12.28-lxc.old
Archive saved: /tmp/boot/symvers-2.6.32-431.20.5.el6.i686.gz
Archive saved: /tmp/boot/config-2.6.32-431.23.3.el6.i686
Archive saved: /tmp/boot/config-2.6.32-431.20.5.el6.i686

Capable of creating archives and uploading material into archives. Glacier is a bit difficult to work with as you cannot really see what has happened but you must trust that what you have done works the way it has been described.

What that means is that there is no way to see your files. They simply are there. Perfect solution would be to verify that the files have been uploaded and I might look into that but for now the script only checks for exit status and if it is 0 then it writes another file onto the filesystem to signify that this file has been uploaded. So that it can continue if something goes wrong.

It is, however, possible to initiate these sort of jobs but they take time.

# aws glacier initiate-job --account-id - --vault-name work0 --job-parameters '{"Type": "inventory-retrieval"}'
# aws glacier list-jobs --account-id - --vault-name work0

delete-vault

This script will be able to take maximum archive age as a parameter and delete all the archives which are older. This way it will be possible to pass something like 8294400 as an argument to signify that archives older than 96 days should be deleted.

compress-archive

This is for compressing given archive.

others scripts

These are yet to be designed but someone needs to keep track of what’s in the cloud and how to deal with it.

All the scripts work independently and have to real ties to one another so they can run when they like and perform they tasks as they wish. If something fails it will also fail very gracefully and should not affect other scripts too much.

Amazon Glacier API

They have made it either robust or overly complicated, depending on how you want to think about it:

TreeHash-MPU

It adds another layer of computations and data processing so that is something that I could live without, but won’t bother me too much either.

If you are using an AWS SDK to program against Amazon Glacier, the tree hash calculation is done for you and you only need to provide the file reference.

Already started to write Python but apparently it isn’t necessary.

Well since I am not using AWS SDK but Bash then it seems it doesn’t apply to my situation: https://docs.aws.amazon.com/cli/latest/reference/glacier/upload-archive.html

Python seems to have package available in pip: https://pypi.python.org/pypi/TreeHash/1.0

And even better it is available as command so it can be used in these scripts.

Compression

Adding another machine to do the compression with xz using highest compression because my main machine responsible for backups is not powerful enough to do any serious compression.

http://catchchallenger.first-world.info//wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO

Gzip5 would be fast but xz -e saves quite a bit of storage space, which saves money.

If you look at the memory consumption of each algorithm you can clearly see when they were designed:

snap1000

May not apply to lz4 and lzop as those are by design very lean.

I am going to use pxz for parallel lzma compression https://jnovy.fedorapeople.org/pxz/

Local will always store the uncompressed files because they need to be referenced to but the highly compressed ones can be uploaded to Amazon.

Serious problems with pxz:

context size per thread: 201330688 B
hp0.13.dar -> 8/11 threads: [5 7 0 6 4 1 2 3 2 1 0 ] 2097152000 -> 3517960820 167.749%
context size per thread: 201330688 B
hp0.14.dar -> 8/11 threads: [4 6 5 1 3 7 0 2 2 1 0 ] 2097152000 -> 5497661344 262.149%

I cannot understand how it can possibly increase the filesize this much even if it was completely incompressible. If these two are not the single odd balls then I need to switch to bzip2 or something else.

There seem to be some sort of bug with the pxz because filesizes do not match with what’s being reported:

-rw------- 1 root root 2.0G Oct  6 05:37 hp0.13.dar.xz
-rw------- 1 root root 1.9G Oct  6 05:41 hp0.14.dar.xz
-rw------- 1 root root 2.0G Oct  6 05:45 hp0.15.dar.xz

Perhaps it has something to do with me giving it a list of files to compress, instead of launching new process for each file.

And that seems to be the problem because:

hp0.16.dar -> 8/11 threads: [1 0 6 7 2 5 3 4 2 0 1 ] 2097152000 -> 8050552768 383.880%
-rw------- 1 root root 484M Oct  6 06:02 hp0.16.dar.xz

Progress

Progress is slow but steady as new versions come out constantly and modifications to scripts and architecture are made, and need to be made because of the changes.

The scripts may also have been little overengineered as it will verify the whole full backup each time it makes differential. So if the initial backup was 50GB it will then calculate 50GB of sha256 to make sure it is intact. Sure that is secure but it is also extremely heavy.

It also seems that one full backup, compressed, took 42GB and if we expect sort of worst case scenario and 500MB of daily changes, which under certain conditions isn’t entirely unpausible, it would mean 42GB + (90*0.5GB) of data stored in Glacier. 87GB and we can round that to 100GB so it would be 0.70 € per month for the whole systems backup.

Also it turned out that Amazon Glacier doesn’t support filenames or anything so information about that must be stored locally and in description field of each file.

The most ideal case would perhaps be to use something like SQLite to create per backup-cycle (3 months) collection of all the filenames and checksums, and upload this too into the Glacier. It would consume no space so we could just create them and use the latest one if we ever need to restore any data.

Monitoring Percona Server, nginx and Varnish with Cacti

Two or three dozen different options available for monitoring.

cacti-graph-set

https://www.percona.com/doc/percona-monitoring-plugins/1.0/cacti/mysql-templates.html#percona-mysql-monitoring-template-for-cacti

https://www.percona.com/doc/percona-monitoring-plugins/1.0/index.html

Very simple setup. Import the graph template and attach those to monitored devices and configure the script.

My Percona MySQL Server on my router is currently not optimized in any way so after collecting this data for a day I should clearly see where the bottlenecks are. It is only 200MiB database but the other server hosting all these blogs is much bigger at 730MiB. So that too will go under monitoring once this is verified to work.

Nginx

Percona monitoring scripts also include nginx monitoring, which is done over SSH, so that too is monitored.

snap955

https://www.percona.com/doc/percona-monitoring-plugins/1.0/cacti/nginx-templates.html

Varnish

And while I was at it I also added Varnish to this list.

I have two-level Varnish caching:

  • Front-end caches
    • Back-end caches

Of which one front-end is Varnish 3 and two are Varnish 4, and back-end is Varnish 3, and this script only supports Varnish 2 and 3 so I am only monitoring my back-end varnish.

But it should be perfectly possibly to add another layer on top of that Python script to make it support Varnish 4, or to make completely new Cacti template for that. But for now I am happy with this setup. It tells me how much traffic hits my back-end cache and how much goes to the real back-end system.

https://github.com/glensc/cacti-template-varnish

varnishstat command is made available to Cacti via xinetd:

service varnishstat
{
    socket_type         = stream
    wait                = no
    user                = nobody
    server              = /usr/bin/varnishstat
    server_args         = -1
    only_from           = 1.1.1.1
    log_on_success      = HOST
    port                = 4000
    protocol            = TCP
    disable             = no
    bind                = 1.2.3.4
    type                = UNLISTED
}

Cacti

Also shoutout to Cacti for being extremely powerful and versatile system.

Developing Cacti plugins with Percona Monitoring

https://www.percona.com/doc/percona-monitoring-plugins/1.0/cacti/creating-graphs.html

Qt SQL

http://doc.qt.io/qt-5/qtsql-cachedtable-example.html

Supports SQLite which is amazing. You can write applications and store all data locally in simple database with few simple commands. You just have to love it.

While Qt also of course supports all the standard available databases:

Constant Value
QSqlDriver::UnknownDbms 0
QSqlDriver::MSSqlServer 1
QSqlDriver::MySqlServer 2
QSqlDriver::PostgreSQL 3
QSqlDriver::Oracle 4
QSqlDriver::Sybase 5
QSqlDriver::SQLite 6
QSqlDriver::Interbase 7
QSqlDriver::DB2 8

 

http://doc.qt.io/qt-5/qsqldriver.html