"A wise man learns by the mistakes of others, a fool by his own." - Latin Proverb
This book teaches you that “Feature complete” is not the same as “production ready”.
The summary at the end of the book sums up exactly what I feel about software development
Change is the defining characteristic of software. That change – that adaptation – begins with release. Release is the beginning of the software's true life; everything before the release is gestation.
This book shows you how to really think about the design decisions that you make in any project.
Release 1.0 is the beginning of your software's life. Your quality of life after 1.0 depends on choices you make long before that vital milestone.
The reality of most software is that we mean well when we build it, however we often get this wrong.
New software emerges like a new university student, full of optimistic vigour, suddenly facing the harsh realities of the world outside the development environment. Things happen in the real world that just don't happen during development, usually bad things.
The book is broken down into four sections:
- General Design Issues
The book defines stability as follows:
A stable system is one that keeps processing transactions, even when there are transient impulses, persistent stresses, or component failures disrupting normal processing.
An impulse is a rapid shock to the system and stress is a force applied to the system over an extended period.
Sudden impulses and excessive strain can trigger catastrophic failure. These failures can expose cracks in the system. These cracks are called failure modes. When a catastrophic failure occurs there is always chain of failure that caused it. One has to realise that the events are not independent as there is always a layer of coupling.
Some interesting Stability Anti-patterns:
- Blocked threads – Don't hold onto resources.
- Attacks of Self-Denial – kicking yourself while you are down.
- Unbalanced capacities – Make sure that all systems can take each others load.
- Slow Responses – These usually result from excessive demand.
To combat these here are some interesting Stability patterns:
- Use Time-outs – Now and forever networks will always be unreliable, well-placed time-outs provide fault isolation.
- Circuit Breaker – Protect your system from all manner of integration points problems. If the integration point is down stop calling it!
- Fail Fast – If your system can't meet it's SLA inform callers quickly. Check resource availability at the start of the transaction.
- Test Harness – Make sure that you simulate real-world failure modes. A great system for this is the recent is Chaos Monkey
By understanding the stability anti-patterns and the stability patterns described in the book one can prevent these cracks propagating through our layers in the system.
The book defines capacity as follows:
Capacity is the maximum throughput a system can sustain, for a given workload, while maintaining an acceptable response time for each individual transaction.
Throughput describes the number of transactions the system can process in a given time span.
The hardest thing about dealing with capacity is working with non linear effects. In every system, exactly one constraint determines the system's capacity. It is important to understand these constraints.
Along with capacity comes some myths. I really never thought about these:
- CPU is cheap – In reality 250 milliseconds per transaction adds up to 69.4 hours of CPU time every day.
- Storage is cheap – Storage is a service not a device.
- Bandwidth is cheap – Dynamically generated pages tend to have a lot of junk in it. 1K of junk per page equates to 1GB of junk with 1 million page views a day.
Some interesting Capacity Anti-patterns:
- AJAX Overkill – Don't return small HTML and then send 400 request back to your server to get the reset. Best to return the JSON to the browser on the first request.
- The Reload Button – Fast sites don't provoke the user into hitting the Reload button.
- Cookie Monsters – Don't store your database in a cookie.
To combat these here are some interesting Capacity patterns:
- Use Caching Carefully – Limit your cache size and cache expensive objects.
- Pre-compute Content – If generating the content is expensive, process it offline.
I recently tried to apply the wasted space remover pattern in one of our projects. I was able to remove just under 10k worth of white-space (though I did manage to introduce bugs which frustrated the team). However after discussing with the team compression seems to take care of most of it. I still think it is important for clients that don't support compression (I know there are less and less every-time). This article seems to have some great statistics around why you shouldn't bother.
By understanding the capacity anti-patterns and the capacity patterns described in the book one can understand how to fine tune the system. This is achieved by an ongoing process of monitoring.
Capacity is fundamentally a measure of how much revenue the system can generate during a given period of time.
General Design Issues
There are many great topics in this section, however the one that I found really interesting is availability.
Availability of a system is typically measured as a factor of its reliability - as reliability increases, so does availability. However, no system can guarantee 100.000% reliability; and as such, no system can assure 100.000% availability – Wikipedia
An interesting take on availability is discussed in one of stability anti-patterns called SLA inversion.
SLA inversion states that unless every one of your dependant systems is engineered for the same SLA you must provide, then the best you can possibly do is the SLA of the worst dependant system.
It gets even worse than that statement. If built naively the probability of a system failing is the joint probability of a failure in any component or service. This means that if your system has five external services that each have 99.9% availability then the best your system can do is 99.5% (a little unclear on the maths would be great if someone pointed me in the right direction)
This chapter begins with a great story around when it rains it pours and how they dealt with it. Really interesting. The topic that really meant something to me in this section was Transparency.
Experienced engineers on ships can tell you when something is about to go wrong by the sound of the giant diesel engines. Transparency refers to the qualities that allow operators, developers and business sponsors to gain understanding of the systems historical trends, present conditions, instantaneous state and future projections.
Designing for transparency is really important as adding transparency late in the development is about as effective as adding quality.
Some great ideas around transparency are as follows:
- Monitoring and reporting systems should be built around your system, not in it. Better to expose than to couple to a service.
- Make sure you discuss what triggers alerts.
- Logging is still very important to this day and what you log is more important. A pet hate of mine is a system that tells you everything is OK in log files. Log file should only be used as a way to see what is going wrong with the system (what are your thoughts around this?).
- It is important to also get the logging levels right. I personally think that in production nothing over WARNING should be allowed.
- Understanding all the messages your system will produce is important. This is easier if you built the whole application. Message codes simplifies the communication between development and operations.
- Remember that in the end all of your decisions need to be understood by humans. When stressful situations occur the last thing you want is to try to decipher what the system is trying to tell you.
This has been a fantastic read and I really recommend it to anyone that wants to really think about what happens to your system once it goes in the wild. A lot of the issues these days are being addressed by cloud providers. This books really shows some of the early work that the DevOps movement is trying to solve so I applaud it. If you want more information check out this blog.