I've worked in infrastructure for most of the 15 years of my career, cutting my teeth in the UK ISP market back when what we called "virtual servers" were badly secured chroot environments. These days, I believe we're referring to these as "Docker Containers", and that's not all that's changed. I now spend my days running a small infrastructure team in London, supporting businesses of all shapes and sizes.
New tools, platforms and practises are springing up all the time, and it's often difficult to know which of these will change our working lives for the better, and which will become a time sinking problem child. Our economies of scale rely on being able to reuse solutions across a number of customers, so it's important that we think carefully before we bet the server farm on the latest shiny silver bullet.
So, if you'll forgive the blatant Buzzfeediness of the title to this post, gather round the Sysadvent Tanenbaum (Andrew S., of course) and I'll share the 14 characteristics that we consider when we're evaluating something a new piece of software, platform or service. These are the goals we set ourselves for infrastructure, and the thing under consideration must help us further at least some of these goals.
First of all, we're concerned with ease of management. This often comes down to user experience: in the case of software, how easy is it to install and configure? If we're talking about a platform or service, is the UI friendly? Is there an API? What are its limitations?
Ultimately, the best solutions minimise the amount of cognitive load required to work with them - and not because operations teams aren't smart people. I think Redis is a good example of manageable software: it's easy to install, the configuration file is straightforward, and the command interface makes sense.
Anything that requires too much thinking on a day to day basis is likely to be difficult to manage, and as many of us will have learned the hard way by now, complicated systems are more likely to fail.
Speaking of failure, a good infrastructure needs to consider availability. Some tools (such as ElasticSearch or RabbitMQ) are capable of replicating state across a cluster and thus their availability strategy is to run more than one node, and balance requests between them. Other services don't replicate state data, or necessitate a single master, and so it's necessary to follow a failover scenario instead.
It's important to realise that we're not necessarily talking about 100% uptime: a good infrastructure is available enough for the business it is supporting. In some cases, nine 5s of availability may well be more appropriate than five 9s, and often, trading availability off against cost is the right thing to do.
3. Cost Effectiveness
Cost, then, is a relative concern. Not every business is rolling in Aeron chairs, foosball tables, and filthy venture lucre. Some of us make an honest bootstrapped living and, as anyone who's ever tried to get my business partner to buy a new laptop will be reminded, it's important to keep an eye on the bottom line, only spending money where necessary. Good infrastructure should be cost effective: pricing of components and services should scale appropriately with business value.
4. Supplier Agnosticism
One way of keeping costs in check is to ensure you choose tooling that doesn't lock you into the services of a particular commercial vendor. Mentioning no Oracle names, some companies appear to lure you in with the promise of inexpensive consultancy hours, get you hooked on their product, and then slowly ramp up the license fees every year.
When we build infrastructure, we try and do so in a way which remains portable between suppliers, preferring open source tooling, and where possible abstracting anything which makes use of proprietary APIs. For example, when we use the AWS APIs from our Puppet manifests (in order to get a list of hosts to use in a config file), we do so through a custom Hiera back-end. If we deploy the same manifests to another cloud supplier, we can just switch out that Hiera back-end without changing our manifests.
On a related note, sticking to standard solutions is a good way to keep your infrastructure effective. In the majority of your infrastructure, you won't be solving any unique problems, so use the de facto standard tools since these will likely cover your use case. If you think your URL routing problem is so unique that you have to write a brand new bespoke traffic director in Go, rather than using HAProxy, Varnish or Nginx, chances are you haven't thought about the problem sufficiently. If you're going to build rather than reuse, then the thing you build should be a significant differentiator in your market, or you could probably spend that time better somewhere else.
There's a balance to be struck, however: new solutions are often worth looking at, particularly if they seem likely to become the new standard one day. One great example of this is Vagrant, which we adopted very early on, and which we've since derived a huge amount of value from both internally and in the teams we work with. Beware the temptation of Not Invented Recently syndrome, though: the tendency to assume that something new and shiny is always going to be better than the trusty, crusty predecessor. Evaluate carefully before making the leap. We currently have Terraform and the Docker ecosystem under evaluation for our standard solution: they both look like they will deliver value to us, but are new enough that we're being cautious.
Another important concern is scalability. Any infrastructure you build should be capable of changing in size to meet the changing demands of your business. And remember, we're not just considering the scaling of systems: the scalability of teams and processes need to be considered too. We're also not just considering scaling up: being able to scale down and simplify things can be just as important. One piece of software which seems to get scalability right is ElasticSearch: as demands on a search cluster grow, adding nodes and sharding indexes is trivial, and all going on without needing to take the service offline completely. Compare this with scaling up a MySQL cluster, and all the manual work that takes.
Related to scalability is performance. A good system should be performant enough for the needs of the business. These needs will often vary by market sector: infrastructure for an e-commerce vendor selling knitting patterns will have different performance requirements than one selling tickets for a popular musician (though oddly both of these entirely non-hypothetical customers insist on using bloody MongoDB for some reason). When we're evaluating new tooling for a customer infrastructure, we'll often use load tests to prove that we can get the transactional throughput they'll need before everything starts to melt and fall apart.
Things that don't fall apart under load are considered to be stable, but that's not the only definition we're concerned with. When we're evaluating new tools or platforms, we also care about the amount of breaking change that they undergo. If API changes are incompatible across minor version updates then we're likely to have a bad time keeping up to date with those releases, and anyone who's written software that relies on Ruby libraries knows precisely how that feels. For the benefit of the lucky reader who hasn't written software relying on Ruby libraries: it feels remarkably like being repeatedly slapped in the face.
Our standard platform uses slightly older, tried and tested tools instead of some of the shiny new stuff. A stable platform is easier (and therefore less expensive) to support.
Having said that, change is actually pretty great! And even if you don't agree, I don't care, because you'll find it's also an absolute inevitability. Businesses that are able to make changes can quickly adapt to market conditions and beat their competition to the punch (which is apparently a boxing analogy, and nothing to do with dangerous mixed party beverages). An effective infrastructure needs to be responsive to change, otherwise it will quickly cease being relevant.
There are some industries for whom change is a scary prospect: for example finance, or pharmaceuticals. Don't get me wrong, they want to be able to adapt to market conditions just as much as the next vertical, but they must also maintain regulatory compliance, and satisfy their auditors. If you've never encountered an auditor in the wild, these are often fastidious, pernickety creatures, with an unhealthy interest in the generation of paperwork. If you can choose solutions that make the paperwork generation easier, you'll have a much better time all round: no good DevOps engineer I've met wants to spend their working life writing Word documents.
If you enjoy strong opinions on the subject of regulatory compliance, you can watch me rant on the subject at https://puppetlabs.com/presentations/puppet-and-devops-regulated-environments
12. Operational Visibility
Audit trails are an example of operational visibility, a term I'm mostly using here to talk about logging and monitoring. A good infrastructure generates operational data in a meaningful and helpful aggregated way. When we're evaluating a new piece of server-side software, one of the considerations we have is how easy it will be to monitor for performance and availability. The answer tends to be either "fantastic" or "abjectly miserable", because as we should all know by now: monitoring sucks.
An effective infrastructure has a security posture that's appropriate for the business need. At a basic level, administrative tooling should require authentication and authorisation, be inaccessible to the general public, and employ an appropriate level of cryptographic protection. Industries dealing with personally identifiable data, credit card numbers and the like will require stronger countermeasures. Unlike some of the other habits we're talking about here, it's comparatively difficult to retrofit good security practise, and it's good to understand the security requirements up-front so as to avoid future rework. Not being one to pass up another opportunity to hate on MongoDB, I'll illustrate this by sharing that one of our more regulated customers had to ditch the MongoDB part of their solution and replace it with another data store because at the time, Mongo had no user security model to speak of and this was deemed unacceptable for that system of record.
Repeatability is a big concern for us, because we build new infrastructures for new customers all the time. Using automation tools means we can stamp out the basics of an effective infrastructure in relatively short order. Getting this repeatability right means that building development and test infrastructures is also straightforward. In larger, more established organisations (and even in one or two younger, dumber ones) I've heard project managers cite "lack of environments" as their main impediment to project delivery, so it's clear that this stuff is important. If your effective infrastructure is repeatable, the Gantt chart jockeys will have to find some other excuses for not hitting their optimistic delivery schedules, and isn't it about time they did?
(Gift) Wrapping Up
People who haven't squandered their festive goodwill on debugging problems with JMX tell me that the holiday season is a time for sharing, and so I hope what I've shared above is of some use to you in 2016 and beyond. Looking back over this list, there's one main theme: that it's important to understand the wider business context in which your infrastructure exists, and to use that to guide your decision-making.
However you're spending the rest of this year, I wish you the company of the people you love, a silent pager, and a monitoring dashboard of the purest green.