Lots and lots of useful information and insight in this article from James Hamilton.
Starting off with this simple observation:
Every couple of weeks I get questions along the lines of “should I checksum application files, given that the disk already has error correction?” or “given that TCP/IP has error correction on every communications packet, why do I need to have application level network error detection?” Another frequent question is “non-ECC mother boards are much cheaper – do we really need ECC on memory?” The answer is always yes
So much of our current IT systems are based on the assumption that every single bit at every single layer are perfectly transmitted without fail. Anyone who’s done any kind of development can assure you that this is a pretty unlikely scenario. Internally many components do integrate various error checking schemes, but then it passes to another layer that can introduce an error. And again, note that most systems integrate error checking, not error correction. Then we add in pervasive virtualisation which adds in additional code who’s design purpose is to lie to the other layers that surround it. As long as it manages to keep its lies straight we’re OK, but it assumes that everything it’s talking to is telling the truth, and more importantly not making any mistakes.
Then comes this observation:
Upon deep investigation at some customer sites, we found the software was fine, but each customer had one, and sometimes several, latent data corruptions on disk. Perhaps it was introduced by hardware, perhaps firmware, or possibly software. It could have even been corruption introduced by one of our previous release when those pages where last written. Some of these pages may not have been written for years.
This just further confirms my experience and is the reason that my recommendation for most clients and that all of my personal data is stored on ZFS based systems to ensure that at least the storage layer is making a serious attempt at end to end data integrity. And I use ECC memory in those servers.
I’m still not protected from application and network layer errors, but I can at least have reasonable confidence that the data I asked to be stored is being stored correctly.
And then we move on into the software and design realm.
Systems sufficiently complex enough to require deep vertical technical specialization risk complexity blindness. Each vertical team knows their component well but nobody understands the interactions of all the components. The two solutions are 1) well-defined and well-documented interfaces between components, be they hardware or software, and 2) and very experienced, highly-skilled engineer(s) on the team focusing on understanding inter-component interaction and overall system operation, especially in fault modes. Assigning this responsibility to a senior manager often isn’t sufficiently effective.
Which is why I have relatively little confidence in many SOA architectures based on loosely coupled interfaces implemented by the lowest bidder based on either incomplete specifications or specifications detailed to the point of obfuscation. Amusingly, the biggest push for these architectures are the IT departments who seem to imagine that this will make their life easier, when it’s just more code to maintain that is often outside the scope of the core application and the known portion of the problem space. Which adds more complexity to the change management process with the additional dependencies.