Revisiting the home data center architecture

If all goes well I will be adding one or two extremely powerful and new servers in the coming months.

Those servers use 2.5″ disks so the only question is how to implement large scale storage system. I have an old E6600 based server which would be perfectly fine if two 1Gbit connections were trunked together to get 2Gbit iSCSI connection.

2TB in 2.5″ form factor seems to be most cost effective, and prices for 3TB are beyond economical. So if one server could take 4 disks that would in mirrored configuration give 2TB of storage with some faster storage in form of SSD; left over from L2ARC and SLOG.

The old DL360 G3 would be dedicated to only work as firewall and traffic shaper and routing and switching would be moved to dedicated managed gigabit switches.

Also now all servers boot from NFS which has proven to be good, but problematic in case of failure in that NFS server, which has potential to either lock or bring down all the other servers. So NFS would be removed in favor SSD based mirrored ZFS root.

One question mark is my current networking setup which relies heavily on Linux, and which would need to be ported to managed switches. It shouldn’t be a problem, though, since it is technically all VLAN based with some bridges with more specific rules; so those would need to addressed somehow.

Also something like pfSense could be considered. But with firewall and router, if such system is used, I would like to move from i386 to 64bit architecture because currently there have been problems with not enough memory. HP ProLiant DL380 G5 might suit the purpose perfectly as a low cost server.

Quad gigabit PCIe network cards seem to be quite cheap so with three slots it would act as 12-port gigabit router. That would enable either the current Linux-based routing scheme or transition to something like BSD based pfSense. BSD has a reputation of being network oriented system and some studies have demonstrated that it performs extremely well as a router.

But one thing to remember with Linux/BSD based routers is to make absolutely certain that the driver support for network cards is perfect. Otherwise the stack will fall apart. Dedicated routing hardware works perfectly because it has been built to match perfectly with what it was built to be — router and nothing more.

So if the new QEMU/KVM hypervisor would set me back 400 €, disks perhaps 500 €, router 300 € and one or two additional small switches yet another 200 € and 1400VA UPS 250 € then the price tag woud be 1 550 € which isn’t too bad.

That cost would hopefully give me room for another 3 years at least and 2TB of storage and possibility to expand that storage to 14TB by using the router as FC based storage node by dropping 4 gigabit ports to accomodate for the FC card.

ZFS: L2ARC and SSD wear leveling

I noticed this in arcstats:

l2_write_bytes                  4    260231528448

In less than 24 hours it has written 240GB into 20GB partition. That’s quite a hell of an impact on such a small space on an SSD, but I assume much of this is because I had to move large amounts of data back and forth.

But this is definitely something that must be monitored because my daily backups could theoretically eat away that SSD quite fast. Especially since I am in process of making new backup system which would verify large amounts of previous backups every single day.

Also the hit-ratio is extremely poor:

l2_hits                         4    2496
l2_misses                       4    5801535

So it might not even be smart to use L2ARC at all for this pool. Seems more random than ZFS can make use of.

233 Media_Wearout_Indicator 0x0032   000   000   000    Old_age   Always       -       655

 

ZFS: more efficient use of memory

I am still trying to optimize memory use because ZFS benefits from it and requires a lot of it.

So yesterday I came across something that sparked the idea of using zram in place of arc. What I mean by that is arc is used to store meta data and actual data. By default meta data is one fourth of the arc size but it can be manipulated.

So now I have 12GiB of arc of which 6GiB is for meta data and then I use lz4 compressed zram as l2arc because it gives me compression, which is not available for arc. It introduces little extra latency but it is latency worth taking because it is orders of magnitude less than disk latency.

I am currently getting 250-300% compression on my l2arc so by these numbers I have doubled the memory available for zfs.

/dev/zram0:     226.16% (4001792 -> 1769384)
/dev/zram1:     226.47% (3964928 -> 1750690)
/dev/zram10:    453.25% (1835008 -> 404851)
/dev/zram11:    456.69% (1875968 -> 410772)
/dev/zram12:    305.71% (1048334336 -> 342913450)
/dev/zram13:    305.71% (1048334336 -> 342910918)
/dev/zram14:    297.65% (1048272896 -> 352177859)
/dev/zram15:    297.65% (1048272896 -> 352181555)
/dev/zram2:     231.42% (3956736 -> 1709736)
/dev/zram3:     223.10% (3985408 -> 1786336)
/dev/zram4:     217.22% (3932160 -> 1810168)
/dev/zram5:     222.63% (3948544 -> 1773556)
/dev/zram6:     225.26% (3948544 -> 1752851)
/dev/zram7:     226.23% (3940352 -> 1741715)
/dev/zram8:     255.88% (271933440 -> 106272410)
/dev/zram9:     249.91% (290062336 -> 116064574)

0-7 is swap and 8 through 15 are used by three different pools. With little optimization the most used data should occupy nice portion of this space. The 32 gigabytes is a lot of memory when you use it wisely.

Edit: it could be that this stalls the machine so I am testing without this.

ZFS L2ARC

revodrive

OCZ RevoDrive 80GB PCIe

Nice hardware but sadly quite slow by modern standards: 75 000 IOPS. Modern SSD can achieve similar performance. [Edit: as mentioned later on, the figures SSD manufacturers provide are misleading so the question remains: can this device provide constant 75 000 IOPS in which case it is much much better at that]

See Wikipedia

But it gave me an idea.

sonet

This one can house two SSD. It would leave me two on-board SATA completely free and server could still have SSD.

That would be the absolute best solution. It would mean I can have two Western Digital Re 3TB with two SSD which would give me double the IOPS vs. one and hence extreme performance on my pool.

While doing research on what would be the best SSD IOPS-wise, I found out the manufacturers exaggerate their IOPS figures. According to this the figures they give can only be achieved on a new drive for very short periords of time.

Some manufacturers seem to provide Sustained figures which use some sort of standard-type agreed-upon way of measuring the real IOPS figures, while others don’t.

OCZ gladly gives these figures for their Vector 150 drives but sadly 12 000 steady random writes isn’t that good at all.

And while this is quite an entry-level SSD and there are better it is of little use until they start to provide these measurement figures.

And one cannot trust the tests done by testers unless they understand this and run their tests for extended periors of time! Because otherwise we may get skewed results.

But the best plan still would be to get as much RAM as possible, then of course get as much disk as possible, and finally have fast FLASH storage for anything that spills over the RAM.

And with ZFS and FLASH cache I could perhaps even use consumer grade SATA to save money and still have reliability and performance.

Let us calculate for the fun of it.

Western Digital Red 3 TB, 2 pieces for 250 €
Sonnet Tempo SSD Pro, 250 € delivered

And it seems by comparing OCZ Vertex 460 120GB and OCZ Vector 150 120GB that they use similar technology on both of these since steady-state random write figures are exactly the same.

Looking at other parameters we go with the Vector 150.

OCZ Vector 150 240GB, 150 €

Which would give 6TB of storage with 240GB of L2ARC for 650 €.

Compared to original plan which was to buy 3TB enterprise quality SATA and combine it with 250GB SSD which would cost 180 € for Re4 3TB and 107 € for Samsung 840 EVO 250GB for a total of 287 €.

Now that’s a difference!

Good, bad? That certainly is a very good question. But the 650 euro one would perhaps provide much greater performance.

2.3 times the price for 2.0 times the capacity. The 0.3 should then be covered by the fact that there would then be empty slot for another SSD and by the performance increase.

So I think the 650 euro deal is the better one.

And with 4TB WD Desktop Mainstream one would get 8TB of storage for an additional 100 euro.

So 750 euros for 8TB pool with 240GB cache. That’s a shitload of money.

But still, that is only 9,3 euro cents a gigabyte of high-performance and reliable storage. 9,3 cents for gigabyte 10 years ago was considered cheap, and it was for hard drive only.

Raw enterprise storage would with Re4 3TB be 6 cents. So there is definite margin there.

But comparing these (raw vs. real setup) is quite useless. One can buy storage and that’s it but one can’t then just get the performance out of it.

Upgrade the disks to 5TB, add in another 240GB cache for a total of 480GB (or even 1TB) and you have 10TB pool with 1TB cache.

What would that cost?

WD Red 5TB, 2 pieces for 410 €
OCZ Vector 150 480GB, 2 pieces for 530 €
Sonnet Tempo SSD Pro, 250 € delivered

That would be 1190 € for 10TB high-performance, high-reliability storage pool.

Talking about enterprise or business money that is peanuts and nothing.

Edit: I made a mistake where I sacrificed reliability for money saving since I figured I won’t need faster disks because I have SSD to cache. But I still need reliable disks to achieve that reliability.

So the prices will go up by perhaps 20% since reliability still requires Re4 enterprise level disks for raw storage. Bonus from this accident is increase in performance.

So, now we would have 8TB pool with 1TB cache for 480 € for 2 pieces of Re4 4TB, 530 € for 2 pieces of OCZ Vector 150 480 GB and 250 € for Sonnet Tempo SSD Pro for a total of 1260 €.

Which of course is some 70 euros more for 2 TB less and for 20% larger MTBF, the performance increase and two years longer warranty time.

WD Red

[pdf]http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771442.pdf[/pdf]

WD Re4

[pdf]http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-701338.pdf[/pdf]

Seems to also have an order of magnitude lower rate of Non-recoverable read errors so go figure.

Disks for surveillance use I would never touch because those are made for video stream which isn’t too sensitive about bit flipping and other errors which can in the context of video be considered minor errors.

ZFS pool

And logical continuum for the memory news will be to make some notes for how will the ZFS pool be like.

I have been running one for few months and I have zero complaints. Absolutely none.

Cannot afford deduplication but LZ4 gives me roughly 5-15% space saving.

With 16 GB I will obviously be trying deduplication for large data masses as well (3 TB) and see if there are any slowdowns; most probably.

Or then I will simply use dedupe for some of my data, say, long-term storage where it doesn’t really matter that much since data won’t move.

Ultimately I of course hope for full deduplication of my 3 TB with compression for at least that 25% space save/increase. That would then be 3.75 TB or best case near 4 TB.

And 4 TB over 3 TB with raw disk space would cost an additional 70 or some euros so to get this with CPU power and memory, which cost less than this is of course a win.

And as written earlier, Samsung 840 EVO 250GB which I am planning on buying, would provide some L2ARC cache and log space as well, to speed up the reads. 500MBps cache for often accessed data in quantities of tens of gigabytes.

And when using two physically different disks the filesystem will spread the possible multiple copies (of important data) to be stored on both of these disks.

Then I am hoping to summon the money for tape drive to take cold backups. Stored somewhere else. Then I will be happy with my storage.

Samsung 840 EVO vs. WD VelociRaptor

So, I have been reluctant to go with the SSD because I have felt they are too consumer-ish for me.

But looking at the reviews it seems that is foolish because even these consumer grade SSD seem to outperform VelociRaptor in every possible scenario.

But there is one thing to keep in mind at least with this particular TLC one, which is apparently number of NAND chips.

The number of chips affects the speed as seen on this very illustrative graph:

turbowritesm

AnandTech

The big drops are where this drive’s “turbo” runs out, which essentially is faster SLC type of memory used for writes. All writing is first done to this fast SLC type memory from where it is “slowly” copied to TLC memory.

But that is not the point.

The point is as you can see 120 GB model’s performance which is absolutely poor. That one would loose in throughput to VelociRaptor whereas the 250 gibibyte model would overpower it.

And not only throughput but latency-wise SSD will easily outperform VelociRaptor. From what I have seen VR has latency of some 5-6 milliseconds where these have latency in range of microseconds.

Erase latency is higher at 2-3ms but still lower than any HDD I have seen. Maybe some 15K SAS excluded. Don’t know.

Also IO operations far outperform mechanical disks. So in retrospect I would have been stupid and naive to think VelociRaptor would be better choice for, what practically would be operating system type workload (virtual machines); lots of small IO operations.

And even if the throughput was lower on this SSD (it is not) it would still be better, performance wise, to have more IO vs. more throughput.

So it seems I will go with Western Digital Re4 3TB for big storage and backup, and have 250GB for virtual machine images.

In addition to this, this SSD will provide L2ARC cache for my ZFS. So it is absolutely win-win-win situation. And more L2ARC should mean less RAM, or at least I can afford to use more RAM for VM without serious sacrifice on underlying ZFS performance.

Goes to show the importance of re-evaluating your own opinions and held beliefs, since I lived under the impression that my server would be better because it would have had 10K state-of-the-art HDD when, in fact, relatively consumer-grade SSD is still going to be better choice.

TLC of course will suffer from wearing since it will loose its ability to distinguish between three different voltage levels faster than it would if there were two (MLC) or one (SLC) voltage level. But even with this it would still last for years and years and years EVEN with hundred or so gigabyte daily write rate.

SWAP on small portion of TLC SSD could prove to be bad idea, but then again if one’s server is swapping that much that it will eat away an SSD, then that probably isn’t the biggest of one’s problems.

And here’s another nail on the coffin of WD VelociRaptor:

untitled-3

 

Guru3D

I do not even care what this PCMark tests but the language of this speaks for itself no matter what it measures basically.

And when you look at the table here you can see even the MTBF is bigger than that of VelociRaptor’s. Given of course these are calculated in a somewhat hazy manner usually as far as I have heard.

blockdia

 

And another:

untitled-5

Which clearly shows it is an order of magnitude improvement over old technology and it would be shame to not take full advantage of this.