"A wise man learns by the mistakes of others, a fool by his own." - Latin Proverb
This
book teaches you that “Feature complete” is not the same as
“production ready”.
The
summary at the end of the book sums up exactly what I feel about
software development
Change
is the defining characteristic of software. That change – that
adaptation – begins with release. Release is the beginning of the
software's true life; everything before the release is gestation.
This book shows you how to really think about the design
decisions that you make in any project.
Release
1.0 is the beginning of your software's life. Your quality of life
after 1.0 depends on choices you make long before that vital
milestone.
The reality of most software is that we mean well when
we build it, however we often get this wrong.
New
software emerges like a new university student, full of optimistic
vigour, suddenly facing the harsh realities of the world outside the
development environment. Things happen in the real world that just
don't happen during development, usually bad things.
The book is broken down into four sections:
Stability
Capacity
General Design Issues
Operations
Stability
The book defines stability as follows:
A
stable system is one that keeps processing transactions, even when
there are transient impulses, persistent stresses, or component
failures disrupting normal processing.
An
impulse is a rapid shock to the system and stress is a force applied
to the system over an extended period.
Sudden
impulses and excessive strain can trigger catastrophic failure. These
failures can expose cracks in the system. These cracks are called
failure modes. When a catastrophic failure occurs there is always
chain of failure that caused it. One has to realise that the events
are not independent as there is always a layer of coupling.
Some
interesting Stability Anti-patterns:
Blocked
threads – Don't hold onto resources.
Attacks
of Self-Denial – kicking yourself while you are down.
Unbalanced
capacities – Make sure that all systems can take each others load.
Slow
Responses – These usually result from excessive demand.
To
combat these here are some interesting Stability patterns:
Use
Time-outs – Now and forever networks will always be unreliable,
well-placed time-outs provide fault isolation.
Circuit
Breaker – Protect your system from all manner of integration
points problems. If the integration point is down stop calling it!
Fail
Fast – If your system can't meet it's SLA inform callers quickly.
Check resource availability at the start of the transaction.
Test
Harness – Make sure that you simulate real-world failure modes. A
great system for this is the recent is Chaos
Monkey
By
understanding the stability anti-patterns and the stability patterns
described in the book one can prevent these cracks propagating
through our layers in the system.
Capacity
The
book defines capacity as follows:
Capacity
is the maximum throughput a system can sustain, for a given workload,
while maintaining an acceptable response time for each individual
transaction.
Throughput
describes the number of transactions the system can process in a
given time span.
The hardest thing about dealing with capacity is working
with non linear effects. In every system, exactly one constraint
determines the system's capacity. It is important to understand these
constraints.
Along with capacity comes some myths. I really never
thought about these:
CPU is cheap – In reality 250 milliseconds per
transaction adds up to 69.4 hours of CPU time every day.
Storage is cheap – Storage is a service not a
device.
Bandwidth is cheap – Dynamically generated
pages tend to have a lot of junk in it. 1K of junk per page equates
to 1GB of junk with 1 million page views a day.
Some
interesting Capacity Anti-patterns:
AJAX
Overkill – Don't return small HTML and then send 400 request back
to your server to get the reset. Best to return the JSON to the
browser on the first request.
The
Reload Button – Fast sites don't provoke the user into hitting the
Reload button.
Cookie
Monsters – Don't store your database in a cookie.
To
combat these here are some interesting Capacity patterns:
Use
Caching Carefully – Limit your cache size and cache expensive
objects.
Pre-compute
Content – If generating the content is expensive, process it
offline.
I
recently tried to apply the wasted space remover pattern in one of
our projects. I was able to remove just under 10k worth of
white-space (though I did manage to introduce bugs which frustrated
the team). However after discussing with the team compression seems
to take care of most of it. I still think it is important for clients
that don't support compression (I know there are less and less
every-time). This article
seems to have some great statistics around why you shouldn't bother.
By understanding the capacity anti-patterns and the
capacity patterns described in the book one can understand how to
fine tune the system. This is achieved by an ongoing process of
monitoring.
Capacity
is fundamentally a measure of how much revenue the system can
generate during a given period of time.
General
Design Issues
There are many great topics in this section, however the
one that I found really interesting is availability.
Availability
of a system is typically measured as a factor of its reliability - as
reliability increases, so does availability. However, no system can
guarantee 100.000% reliability; and as such, no system can assure
100.000% availability – Wikipedia
An interesting take on availability is discussed in one
of stability anti-patterns called SLA inversion.
SLA
inversion states that unless every one of your dependant systems is
engineered for the same SLA you must provide, then the best you can
possibly do is the SLA of the worst dependant system.
It gets even worse than that statement. If built naively
the probability of a system failing is the joint probability of a
failure in any component or service. This means that if your system
has five external services that each have 99.9% availability then the
best your system can do is 99.5% (a little unclear on the maths would
be great if someone pointed me in the right direction)
Operations
This chapter begins with a great story around when it
rains it pours and how they dealt with it. Really interesting. The
topic that really meant something to me in this section was
Transparency.
Experienced
engineers on ships can tell you when something is about to go wrong
by the sound of the giant diesel engines. Transparency refers to the
qualities that allow operators, developers and business sponsors to
gain understanding of the systems historical trends, present
conditions, instantaneous state and future projections.
Designing
for transparency is really important
as adding transparency late in the development is about as effective
as adding quality.
Some great ideas around transparency are as follows:
Monitoring and reporting systems should be built around
your system, not in it. Better to expose than to couple to a
service.
Make sure you discuss what triggers alerts.
Logging is still very important to this day and what
you log is more important. A pet hate of mine is a system that tells
you everything is OK in log files. Log file should only be used as a
way to see what is going wrong with the system (what are your
thoughts around this?).
It is important to also get the logging levels right. I
personally think that in production nothing over WARNING should be
allowed.
Understanding all the messages your system will produce
is important. This is easier if you built the whole application.
Message codes simplifies the communication between development and
operations.
Remember that in the end all of your decisions need to
be understood by humans. When stressful situations occur the last
thing you want is to try to decipher what the system is trying to
tell you.
Conclusion
This
has been a fantastic read and I really recommend it to anyone that
wants to really think about what happens to your system once it goes
in the wild. A lot of the issues these days are being addressed by
cloud providers. This books really shows some of the early work that
the DevOps movement is trying to solve so I applaud it. If you want
more information check out this blog.