December 20, 2014

Day 20 - The Pursuit of Learning through Bad Ideas

Written by: Michael Stahnke (@stahnma)
Edited by: Michelle Carroll (@miiiiiche)

I have a confession: I love terrible ideas. I really enjoy trying to think of the absolute worst way to solve problems, largely because being a contrarian is fun. Then I realized something — coming up with the exact wrong way to solve a problem is not only a good time, but can actually be helpful.

My love for sharing terrible ideas was codified when one of my teams (and several people from other areas inside engineering) decided to embrace this behavior and create “Bad Idea Monday.” After participating in several debates fueled by the the worst ideas available, some tangible benefits emerged.

Happy employees do better work. This has been proven countless times. What makes employees happy? Fun things, perks, benefits, and pay are up there, but in my experience, what really gets people engaged is learning. Encouraging and embracing new ways of learning are paramount to building the culture you want. Capturing the desire to talk about the worst ways to solve your problems provide a lot of fresh opportunities to learn.

The worst can make you better

As you throw out the absolute worst idea possible to solve something, several outcomes can occur.

  1. Your idea, while terrible, just isn’t bad enough. Somebody else in the discussion thinks they can do better (worse). They try to one-up you. They often succeed, and it’s amazing. This sport of spouting bad ideas leads to collaboration, as one person’s idea gets picked up and added to by others.

  2. A terrible idea isn’t understood by everybody to be terrible. This often happens when there’s a wide range of experience, either in the job, or within this specific problem domain. The discussion can help spread knowledge, as a more experienced team member explains why your solution of “install head mounted GoPro cameras for auditing purposes” might not actually make your audits any cleaner.

  3. Experienced people get a new viewpoint on problems. The problems you face today may be similar to ones you’ve seen before. Trying to think of the worst possible solution forces you to deviate from your usual viewpoint, and can lead to another level of understanding. It can also lead to you reaching for tools or solutions that you’d normally not have considered.

  4. You come up with a real, legitimate solution. It’s likely one you and your team would not have arrived at without getting creative and trying to think of the worst idea. For example, choosing a Google spreadsheet[1] as the back end for an internal service. It sounds like a terrible idea. A spreadsheet isn’t really a database. It doesn’t really have a great query language, it can’t handle lots of updates per second, but it has access control, it’s a familiar interface for non-technical folks, and doesn’t require significant upgrades or maintenance.

  5. The team learns to debate and discuss ideas. This is important. Because these ideas are intentionally terrible, people don’t get offended when somebody shoots down the idea (or builds on it to come up with something worse). It helps the team learn how to debate properly. Learning how to dismantle ideas without judgment is a much healthier and more productive practice than attacking the person with the idea.

How does it work?

Bad Idea Monday doesn’t have to be a Monday, but it works well when it is. Because, let’s be honest, Mondays are the day of the week that people normally dread. There are copious jokes, cartoons, and comics about how much we all hate the first day back at the work after a nice weekend. Capitalize on Monday’s bad reputation, and use it to get your team to generate the worst possible ideas.

How do you get started? First, you need a problem. This problem could come from your ticketing system, a chat conversation, or a face-to-face discussion of something just not working the way it should. The input queue is more or less limitless. After you have a situation, don’t try to solve it — at least not the way you normally would. Turn it on its head. This doesn’t require a meeting. It can happen in any medium, and occur numerous times throughout the day.

Allow me to walk through an example.

Bad Idea Monday in practice

When Puppet Labs was moving our server-side stack from a Ruby-based solution to Clojure and JRuby, we uncovered a new set of problems. We knew we needed a JRE, but that was about all we knew. Did we need a specific JRE? Did we want to compile a JVM for the ~30 permutations of platforms supported as masters on Puppet Enterprise? Were we going to have to package it? Did we want to require that the end-user brings in libalsa because that’s what normal JVMs do?

So the fundamental problem: how do we ship/bundle a JVM to our enterprise customers? What’s the worst answer to this? We could just unzip a binary of the JVM and somehow work it into our filesystem path — that solution was rejected because it wasn’t bad enough. We could use netcat and dd for distribution, but that wasn’t that interesting enough. Then we got an idea. An awful idea. We got a wonderful, awful idea!

the grinch gets a bad idea

We ship the JVM as a gem. Rubygems allows you to compile things on the fly. Rubygems is cross platform. Rubygems is available over the network. Sure, this content wasn’t Ruby, but why should that stop us?

This is a terrible idea. Why? Well, you would need way too many dependencies. You have to have Ruby on the box already. You have to be connected to a network for a successful installation. You can’t express C-header dependencies in Rubygems. You have to have a compiler on the target system. You have to wait something like 35 minutes for the JDK to compile during a Rubygems installation. In most cases, you actually need a JVM in order to bootstrap and compile a JVM. You have to write a mkmf file to instruct the machine how to do that. At the time, signing gems was basically unheard of. You probably don’t want the JVM in your Ruby load path, but maybe you could move the files in a gem postinstall with enough finagling.

This conversation ended shortly after it started, with the team providing these counterexamples, in addition to others not covered here. We knew it was doomed. It was fun though.

We ended up shipping a version of OpenJDK that we built and optimized for our workload using the native package manager for the platforms. However, when we were dealing with some pretty hairy Ruby problems in subsequent releases, we were able to build on our knowledge of the limitations (and advantages) of the more esoteric features of Rubygems — stuff we’d looked into while identifying why it was the worst way to deliver a Java solution. When we needed to bundle some Ruby content with our distribution, that earlier discussion was extremely useful.

What did we learn from the conversation?

  • Knowledge of some of the newer (and esoteric) features of Rubygems. By the end, we’d figured out answers to questions like. What does the postinstall situation really look like? What’s the state of signing a package? What type of compiler manipulation can reasonably be done and expected on an end-user’s system?
  • Why library managers are bad general purpose package managers.[2] This may seem obvious, but it’s a good discussion for those who haven’t really thought about it.
  • Bootstrapping a JVM is a hard problem.

We also had a great time thinking of ways to bend Rubygems to our will.

The rest of the week

The team liked Bad Idea Monday so much, they created theme days for the rest of the week. I’ll walk through them quickly:

Positive Tuesday. This is a day to be positive. The original intent was to offset the perceived negativity perpetuated with bad ideas that happened on Monday, but it’s really not needed for those reasons. The thing I like about it is the ‘find something you like about it’ attitude, which sometimes can help. Everything is not always wonderful. When it’s not, at least on a Tuesday, we can try to improve our outlook by identifying the good parts (or potentially decent outcomes) of an otherwise less-than-awesome situation. This assists in scenarios where you may have lost a debate, but need to move forward. It can bolster a “disagree and commit” interaction paradigm.

Noncommittal Wednesday. Why make a decision today when you could put it off until tomorrow? I think this started as the neutral leg of to balance the bad (Monday) and good (Tuesday). Since then, this day hasn’t done much. I mean, I could tell you more about it, but I just can’t seem to commit to it.

Troll Thursday. Trolling your coworkers can be fun. We keep it pretty clean and innocent, but some days, you just have to see if you can engage the team on something ridiculous, believe some crazy story, or convince them that DECnet[3] really is the one true networking protocol. I enjoy Troll Thursday because it can be used for learning rather than simply for my own amusement. Also, I am not immune to being trolled. ABT.

FriDre. On Friday, two things happen. One, somebody will forget. Two, we will remind them. Heck, our chat bot will remind you. I’ll admit that Not Forgetting About Dre[4] is a little less fun now that he’s the first billionaire in hip hop. Nonetheless, remembering Dre is something that’s been a part of the culture at Puppet Labs for a long time — nearly as long as I’ve been on board. What purpose does it serve? Other than being fun, I have no idea. I’m even pretty sure I’m the one who decided we shouldn’t forget about Dre.


These theme days have made it easier for me to demonstrate three things: the team is creative, they have fun while they work, and they’re an awesome group. We have a wide variety of people, ranging from their mid-twenties to mid-forties. We have people who have worked in tech for years, and people in their first technical role. Some live the US, and at least one doesn’t. We’re not all men. We’re not all packaging geeks. In short, it’s a good mix. A big part of building this team and culture has been finding ways to keep things fun and by driving learning, even as the organization grows and faces new sets of challenges. I encourage you to take an unorthodox look at encouraging learning, management styles, and the non-technical ideas your teammates are bringing to the table — maybe you’ll find something new to dive into.


[1] If you’re wondering, is backed by a Google spreadsheet.

[2] An excellent talk by Ryan McKern called “Packaging is the Worst Way to Distribute Software, Except for Everything else."


[4] This can help you remember.

Further Learning

December 19, 2014

Day 19 - Infosec Basics: Reason behind Madness

Written by: Jan Schaumann (@jschauma)
Edited by: Ben Cotton (@funnelfiasco)

Sysadmins are a stereotypically grumpy bunch. Oh wait, no, that was infosec people. Or was it infosec sysadmins? The two jobs are intersecting at the corner of cynicism and experience, and while any senior system administrator worth their salt has all the information security basics down, we still find the two camps at logger heads all too frequently.

Information Security frequently covers not only the general aspects of applying sound principles, but also the often ridiculed area of “compliance”, where rules too frequently seem blindly imposed without a full understanding of the practical implications or even their effectiveness. To overcome this divide, it is necessary for both camps to better understand one another’s daily routine, practices, and the reasons behind them.

Information Security professionals would do well to reach out and sit with the operational staff for extended periods of time, to work with them and get an understanding of how the performance, stability, and security requirements are imposed and met in the so-called real-world.

Similarly, System Administrators need to understand the reasons behind any requirements imposed or suggested by an organization’s Security team(s). In an attempt to bring the two camps a little bit closer, this post will present some of the general information security principles to show that there’s reason behind what may at times seem madness.

The astute reader will be amused to find occasionally conflicting requirements, statements, or recommendations. It is worthwhile to remember Sturgeon’s Law. (No, not his revelation, although that certainly holds true in information security just as well as in software engineering or internet infrastructure.)

Nothing is always absolutely so.

Understanding this law and knowing when to apply it, to be able to decide when an exception to the rules is warranted is what makes a senior engineer. But before we go making exceptions, let’s first begin by understanding the concepts.

Defense in Depth

Security is like an onion: the more layers you peel away, the more it stinks. Within this analogy lies one of the most fundamental concepts applied over and over to protect your systems, your users and their data: the principle of defense in depth. In simple terms, this means that you must secure your assets against any and all threats – both from the inside (of your organization or network) as well as from the outside. One layer is not enough.

Having a firewall that blocks all traffic from the Big Bad Internet except port 443 does not mean that once you’re on the web server, you should be able to connect to any other system in the network. But this goes further: your organization’s employees connect to your network over a password protected wireless network or perhaps a VPN, but being able to get on the internal network should not grant you access to all other systems, nor to view data flying by across the network. Instead, we want to secure our endpoints and data even against adversaries who already are on a trusted network.

As you will see, defense in depth relates to many of the other concepts we discuss here. For now, keep in mind that you should never rely separate protection outside of your control.

Your biggest threat comes from the inside

Internal services are often used by large numbers of internal users; sometimes they need to be available to all internal users. Even experienced system administrators may question why it is necessary to secure and authenticate a resources that is supposed to be available to “everybody”. But defense in depth requires us to, as it hints at an uncomfortable belief held by your infosec colleagues: your organization either already has been compromised and you just don’t know it, or it will be compromised in the very near future. Always assume that the attacker is already on the inside.

While this may seem paranoid, experience has shown time and again that the majority of attacks occur or are aided from within the trusted network. This is necessarily so: attackers can seldom gather all the information or gain all the access required to achieve their goals purely from the outside (DDoS attacks may count as the obligatory exception to this rule – see above re Sturgeon’s Law). Instead, they usually follow a general process in which they first gain access to a system within the network and then elevate their privileges from there.

This is one of the reasons why it is important to secure internal resources to the same degree as services accessible from the outside. Traffic on the internal network should be encrypted in transit to prevent an adversary on your network being able to pull it off the wire (or the airwaves, as the case may be); it should require authentication to confirm (and log) the party accessing the data and deny anonymous use.

This can be inconvenient, especially when you have to secure a service that has been available without authentication and around which other tools have been built. Which brings us to the next point…

You can’t just rub some crypto on it

Once the Genie’s out of the bottle, it’s very, very difficult to get it back in. Granting people access or privileges is easy, taking them away is near impossible. That means that securing an existing service after it has been in use is an uphill battle, and one of the reasons why System Administrators and Information Security engineers need to work closely in the design, development and deployment of any new service.

To many junior operations people, “security” and “encryption” are near equivalent, and using “crypto” (perhaps even: ‘military grade cryptography’!) is seen as robitussin for your systems: rub some on it and walk it off. You’re gonna be fine.

But encryption is only one aspect of (information) security, and it can only help mitigate some threats. Given our desire for defense in depth, we are looking to implement end-to-end encryption of data in transit, but that alone is not sufficient. In order to improve our security posture, we also require authentication and authorization of our services’ consumers (both human and software alike).

Authentication != authorization

Authentication and authorization are two core concepts in information security which are confused or equated all too often. The reason for this is that in many areas the two are practically conflated. Consider, for example, the Unix system: by logging into the system, you are authenticating yourself, proving that you are who you claim to be, for example by offering proof of access to a given private ssh key. Once you are logged in, your actions are authorized, most commonly, by standard Unix access controls: the kernel decides whether or not you are allowed to read a file by looking at the bits in an inode’s st_mode, your uid and your group membership.

Many internal web services, however, perform authentication and authorization (often referred to as “authN” and “authZ” respectively) simultaneously: if you are allowed to log in, you are allowed to use the service. In many cases, this makes sense – however, we should be careful to accept this as a default. Authentication to a service should, generally, not imply access of all resources therein, yet all too often we transpose this model even to our trusty old Unix systems, where being able to log in implies having access to all world-readable files.

Principle of least privilege

Applying the concept of defense in depth to authorization brings us to the principal of least privilege. As noted above, we want to avoid having authentication imply authorization, and so we need to establish more fine grained access controls. In particular, we want to make sure that every user has exactly the privileges and permissions they require, but no more. This concept spans all systems and all access – it applies equally to human users requiring access to, say, your HR database as well as to system accounts running services, trying to access your user data… and everything in between.

Perhaps most importantly (and most directly applicable to system administrators), this precaution to only grant the minimal required access also needs to be considered in the context of super-user privileges, where it demands fine-grained access control lists and/or detailed sudoers(5) rules. Especially in environments where more and more developers, site reliability engineers, or operational staff require the ability to deploy, restart, or troubleshoot complex systems is it important to clearly define who can do what.

Extended filesystem Access Control Lists are a surprisingly underutilized tool: coarse division of privileges by generic groups (“admins”, “all-sudo”, or “wheel”, perhaps) are all too frequently the norm, and sudo(8) privileges are granted almost always in an all-or-nothing approach.

On the flip side, it is important for information security engineers to understand that trying to restrict users in their effort to get their job done is a futile endeavor: users will always find a way around restrictions that get in their way, often times in ways that further compromise overall security (“ssh tunnels” are an immediate red flag here, as they frequently are used to circumvent firewall restrictions and in the process may unintentionally create a backdoor into production systems). Borrowing a bit from the Zen of Python, it is almost always better to explicitly grant permissions than to implicitly assume they are denied (and then find that they are worked around).

Perfect as the enemy of the Good

Information security professionals and System Administrators alike have a tendency to strive for perfect solutions. System Administrators, however, often times have enough practical experience to know that those rarely exist, and that deploying a reasonable, but not perfect, solution to a problem upon which can be iterated in the future is almost always preferable.

Herein lies a frequent fallacy however, which many an engineer has derived: if a given restriction can be circumvented, then it is useless. If we cannot secure a resource 100%, then trying to do so is pointless, and may in fact be harmful.

A common scenario might be sudo(8) privileges: many of the commands we may grant developers to run using elevated privileges can be abused or exploited to gain a full root shell (prime example: anything that invokes an editor that allows you to run commands, such as via vi(1)’s “!command” mechanism). Would it not be better to simply grant the user full sudo(8) access to begin with?

Generally: no. The principle of least privilege requires us to be explicit and restrict access where we can. Knowing that the rules in place may be circumvented by a hostile user lets us circle back to the important concept of defense in depth, but we don’t have it easier for the attackers. (The audit log provided by requiring specific sudo(8) invocations is another beneficial side-effect.)

We mustn’t let “perfect” be the enemy of the “good” and give up when we cannot solve 100% of the problems. At the same time, though, it is also worth noting that we equally mustn’t let “good enough” become the enemy of the “good”: a half-assed solution that “stops the bleeding” will all too quickly become the new permanent basis for a larger system. As all sysadmins know too well, there is no such thing as a temporary solution.

If these demands seem conflicting to you… you’re right. Striking the right balance here is what is most difficult, and senior engineers of both camps will distinguish themselves by understanding the benefits and drawbacks of either approach.

Understanding your threat model

As we’ve seen above, and as you no doubt will experience yourself, we constantly have to make trade-offs. We want defense in depth, but we do not want to make our systems unusable; we require encryption for data in transit even on trusted systems, because, well, we don’t actually trust these systems; we require authentication and authorization, and desire to have sufficient fine-grained control to abide by the principle of least privilege, yet we can’t let “perfect” be the enemy of the “good”.

Deciding which trade-offs to make, which security mechanisms to employ, and when “good enough” is actually that, and not an excuse to avoid difficult work… all of this, infosec engineers will sing in unison, depends on your threat model.

But defining a “threat model” requires a deep understanding of the systems at hand, which is why System Administrators and their expertise are so valued. We need to be aware of what is being protected from what threat. We need to know what our adversaries and their motivations and capabilities are before we can determine the methods with which we might mitigate the risks.

Do as DevOps Does

As system administrators, it is important to understand the thought process and concepts behind security requirements. As a by-and-large self-taught profession, we rely on collaboration to learn from others.

As you encounter rules, regulations, demands, or suggestions made by your security team, keep the principles outlined in this post in mind, and then engage them and try to understand not only what exactly they’re asking of you, but also why they’re asking. Make sure to bring your junior staff along, to allow them to pick up these concepts and apply them in the so-called real world, in the process developing solid security habits.

Just like you, your information security colleagues, too, get up every morning and come to work with the desire to do the best job possible, not to ruin your day. Invite them to your team’s meetings; ask them to sit with you and learn about your processes, your users, your requirements.

Do as DevOps does, and ignite the SecOps spark in your organization.

Further reading:

There are far too many details that this already lengthy post could not possible cover in adequate depth. Consider the following a list of recommended reading for those who want to learn more:

Security through obscurity is terrible; that does not mean that obscurity cannot still provide some (additional) security.

Be aware of the differences between active and passive attacks. Active attacks may be easier to detect, as they are actively changing things in your environment; passive attacks like wire tapping or traffic analysis, are much harder to detect. These types of attacks have a different threat model.

Don’t assume your tools are not going to be in the critical path.

Another example of why defense in depth is needed is the fact that often times seemingly minor or unimportant issues can be combined to become a critical issue.

The “Attacker Life Cycle”, frequently used within the context of so-called “Advanced Persistent Threats”, may help you understand more completely an adversaries process, and thus develop your threat model:

This old essay by Bruce Schneier is well worth a read and covers similar ground as this posting. It includes this valuable lesson: When in doubt, fail closed. “When an ATM fails, it shuts down; it doesn’t spew money out its slot.”

December 18, 2014

Day 18 - Adding Context to Alerts with nagios-herald

Written by: Katherine Daniels (@beerops)
Edited by: Jennifer Davis (@sigje)

3am Pages Suck!

As sysadmins, we all know the pain that comes from getting paged at 3am because some computery thing somewhere has caught on fire. It’s dark, you were having a perfectly pleasant dream about saving the world from killer robots, or cake, or something, when all of a sudden your phone starts making a noise like a car alarm. It’s your good friend Nagios, disturbing your slumber once again with word of a problem and very little else.

We might hate it for being the bearer of bad news, but Nagios is a well-known and time-tested monitoring and alerting tool. It does its job well- it runs the checks we tell it to, when we tell it to, and it dutifully whines when those checks fail. The problem with its whining, however, is that by default there is very little context around it.

Adding Context to 3am

As an example, let’s take a look at everyone’s favorite thing to get woken up by, the disk space check. Disk Space Alert Without Context

We know that the disk space has just crossed the warning threshold. We know the amount and percentage of free space on this volume. We know what volume is having this issue, and what time the notification was sent. But this doesn’t tell us anything more. Was this volume gradually getting close to the threshold and just happened to go over it during the night? If so, we probably don’t care in the middle of the night - a nice slow increase means that it won’t explode during the night and can be fixed in the morning instead. On the other hand, was there a sudden drastic increase in disk usage? That’s another matter entirely, and something that someone probably should get out of bed for.

This kind of additional context provides really valuable information as to how actionable this alert is. And when we get disk space alerts, one of the first things we do is to check how quickly the disk has been filling up. But in the middle of the night, that’s asking an awful lot - getting out of bed to find a laptop, maybe arguing with a VPN, finding the right graphite or ganglia graph - who wants to do all that when what we really want to do is go back to sleep?

With nagios-herald, the computers can do all of that work for us.

Disk Space Alert With Context

Here we have a bunch of the most relevant context added into the alert for us. We start with a visual indicator of the problematic volume and how full it is, so eyes bleary from sleep can easily grok the severity of the situation. Next is a ganglia graph of the volume over the past day, to give an idea of how fast it has been filling up (and if there was a sudden jump, when it happened, which can often help in tracking down the source of a problem). The threshold is there as well, so we can tell if a critical alert is just barely over the threshold or OH HEY THIS IS REALLY SUPER SERIOUSLY CRITICAL GET UP AND PAY ATTENTION TO IT. Finally, we have alert frequency, to know easily if this is a box that frequently cries wolf or one that might require more attention.

Introducing Formatters

All this is done by way of formatters used by nagios-herald. nagios-herald is itself just a Nagios notification script, but these formatters can be used to do the heavy lifting of adding as much context to an alert as can be dreamt up (or at least automated). The Formatter::Base class defines a variety of methods that make up the core of nagios-herald’s formatting. More information on these methods can be found in their documentation, but to name a few: * add_text can be used to add any block of plain text to an alert - this could be used to add information such as which team to contact if this alert fires, whether or not the service is customer-impacting, or anything else that might assist the on-call person who receives the alert. * add_html can add any arbitrary HTML - this could be a link to a run-book with more detailed troubleshooting or resolution information, it could add an image (maybe a graph, or just a funny cat picture), or just turn the alert text different colors for added emphasis. * ack_info can be used to format information about who acknowledged the alert and when, which can be especially useful on larger or distributed teams where other people might be working on an issue (maybe that lets you know that somebody else is so on top of things that you can go back to sleep and wait until morning!)

All of the methods in the formatter base class can be overridden in any subclass that inherits from it, so the only limit is your imagination. For example, we have several checks that look at graphite graphs and alert (or not) based on their value. Those checks use the check_graphite_graph formatter, which overrides the additional_info base formatter method to add the relevant graph to the Nagios alert:

def additional_info
    section = __method__
    output = get_nagios_var("NAGIOS_#{@state_type}OUTPUT")
    add_text(section, "Additional Info:\n #{unescape_text(output)}\n\n") if output
    output_match = output.match(/Current value: (?<current_value>[^,]*), warn threshold: (?<warn_threshold>[^,]*), crit threshold: (?<crit_threshold>[^,]*)/)
    if output_match
      add_html(section, "Current value: <b><font color='red'>#{output_match['current_value']}</font></b>, warn threshold: <b>#{output_match['warn_threshold']}</b>, crit threshold: <b><font color='red'>#{output_match['crit_threshold']}</font></b><br><br>")
      add_html(section, "<b>Additional Info</b>:<br> #{output}<br><br>") if output

    service_check_command = get_nagios_var("NAGIOS_SERVICECHECKCOMMAND")
    url = service_check_command.split(/!/)[-1].gsub(/'/, '')
    graphite_graphs = get_graphite_graphs(url)
    from_match = url.match(/from=(?<from>[^&]*)/)
    if from_match
      add_html(section, "<b>View from '#{from_match['from']}' ago</b><br>")
     add_html(section, "<b>View from the time of the Nagios check</b><br>")
    add_attachment graphite_graphs[0]    # The original graph.
    add_html(section, %Q(<img src="#{graphite_graphs[0]}" alt="graphite_graph" /><br><br>))
    add_html(section, '<b>24-hour View</b><br>')
    add_attachment graphite_graphs[1]    # The 24-hour graph.
    add_html(section, %Q(<img src="#{graphite_graphs[1]}" alt="graphite_graph" /><br><br>))

In this method, it calls other methods from the base formatter class such as add_html or add_attachment to get all the relevant information we wanted to add for these graphite-based checks.

Now What?

If you’re using Nagios and wish its alerts were a little more helpful, go ahead and install nagios-herald and give it a try! From there, you can start customizing your own alerts by writing your own formatters - and we love feedback and pull requests. You’ll have to wrangle some Ruby, but it’s totally worth it for how much more useful your alerts will be. Getting paged in the middle of the night still won’t be particularly fun, but with nagios-herald, at least you can know that the computers are pulling their weight as well. And really, if they’re going to be so demanding and interrupt our sleep, shouldn’t they at least do a little bit of work for us when they do?

December 17, 2014

Day 17 - DevOps for Horses: Moving an Enterprise Application to the Cloud

Written by: Eric Shamow (@eshamow)
Edited by: Michelle Carroll (@miiiiiche)

As an engineer, when you first start thinking about on-demand provisioning, CD, containers, or any of the myriad techniques and technologies floating across the headlines, there is a point when you realize with a cold sweat that this is going to be a bigger job than you thought. As you watch folks talking at various conferences about the way they are deploying and scaling applications, you realize that your applications won’t work if you deployed them this way.

Most of the glamorous or really interesting, thought-provoking discussions around deployment methodologies work because the corresponding applications were built to be deployed into those environments in a true virtuous cycle between development and operations teams. Sometimes the lines between those teams disappear entirely.

In some cases, this is because Operations is outsourced entirely — consider PaaS environments like Heroku or Google App Engine, where applications can be deployed with tremendous ease, due to a very restricted set of conditions defining how code is structured and what features are available. Similarly, on-premises PaaS infrastructures, such as Cloud Foundry or OpenShift, allow for organizations to create a more flexible and customized environment while leveraging the same kind of automation and tight controls around application delivery.

If you can leverage these tools, you should. I advise teams to try and build out an internal PaaS capability — whether they are using Cloud Foundry or bootstrapping their own, or even several to allow for multiple application patterns. The Twelve-Factor App pattern is a good checklist of conditions to start with for understanding what’s necessary to get to a Heroku-like level of automation. If your app meets all these conditions, congratulations — you are probably ready to go PaaS.

My App Isn’t Ready For PaaS

Unless you’re a startup or have a well-funded team effort to move, your application won’t work as it stands in a PaaS. You are perhaps ready for IaaS (or are evaluating IaaS) wondering, where do I start? If you can’t do much with the application design, how can you begin to get ready for a cloud move with the legacy infrastructure and code you have?

Getting Your Bearings

Start by collecting data. A few critical pieces of information I like to gather before drawing up a strategy:

  • What are the components of the application? Can you draw a graph of their dependencies?

  • If the components are separated from one another, can they tolerate the partition or does the app crash or freeze? Are any components a single point of failure?

  • How long does it take for the application to recover from a failure?

  • Can the application recover from a typical failure automatically? If not what manual intervention is involved?

  • How is the application deployed? If the server on which the application is running dies, what is the process/procedure for bringing it back to life?

  • Can you easily replicate the state of your app in any environment? Are your developers looking at code in an environment that looks as close as possible to production? Can your QA team adequately simulate the conditions of an outage when testing a new release?

  • How do you scale the application? Can you add additional worker systems and scale the system horizontally, or do you need to move the system to bigger and more powerful servers as the service grows?

  • What does the Development/QA cycle look like? Is Operations involved in deploying applications into QA? How long does it take for developers to get a new release into and through the testing cycle?

  • How does operations take delivery of code from development? What is the definition of a deliverable? Is it consistent, or does it change from version to version?

  • How do you know that your application was successfully installed?

I’m not going to tackle all of them, but will rather focus on some of the key themes we’re looking for in examining our apps and environment.


One of the key underpinnings of modern application design is the understanding that failure is inevitable — it’s not a question of if a component of your application will fail, but when. The critical metric for an application is not necessarily how often it fails (although an app that fails regularly is clearly a problem) but how well its components tolerate the failure of other components. As your app scales out — and particularly if you are planning to move to public cloud — you can expect that data will no longer flow evenly between components. This is not just a problem of high latency, but variable latency — sudden network congestion can cause traffic between components to be bursty.

If one component of your application depends on another component to be functional, or your app requires synchronous and low-latency communication at all times between components, you have identified tight coupling. These tight couplings are death for applications in the cloud (and they’re the services that make upgrades and migration to new locations the most difficult as well). Tight couplings are amongst the most difficult problems to address — often they relate to application design and are tightly tied to the business logic and implementation of the application. A good overview of the problem and some potential remedies can be found in Martin Fowler’s 2001 article “Reducing Coupling” (warning: PDF).

For now , we need to identify these tight couplings and pay extra attention to them — monitor heavily around communications, add checks to ensure that data is flowing smoothly, and in general treat these parts of our architecture as the fragile breakpoints that they are. If you cannot work around or eliminate these couplings, you may be able to automate processes for detection and remediation. Ultimately, the couplings between your apps will determine your pattern for upgrades, migrations and scaling — so understanding how your components communicate and which depend on each other is essential to building a working and automated process.


If you can’t reinstall the app without human intervention, you have a problem. We can expect that a server will eventually fail and that application updates will happen on a regular cadence. Humans screw up things we do repetitively — repeat even a simple process often enough and you will eventually do it wrong. Computers are exceptionally good at repetitive tasks. If you have your sysadmins doing regular installs of your applications — or worse if your sysadmins have to call in developers and they must pair to slowly work through every install — you are not taking advantage of the computers. And you’re overtaxing humans who are much better at — and happier — doing other things.

Many organizations maintain either an installation wiki, a set of install scripts, or both. These sources of information frequently vary and operators need to hop from one to the other to assemble and install. With this type of ad-hoc assembly of a process, it’s likely that one administrator will not follow the process perfectly each time, but certain that different administrators will follow the process in different ways. Asking people to “fix the wiki” will not fix the discrepancy. The wiki will always lag the current state of your systems. Instead, treat your installation scripts like “executable documentation.” They should be the single source of truth for the process used to deploy the app.

While you will want your automation to use good, known frameworks, the reality is that a BASH script is a good start if you have nothing in place. Is BASH the way to go for your system automation? As a former employer put it, “SSH in a for loop isn’t enough” — and it’s not. But writing a script to deploy a system in a language you already know is a good way to identify if you can automate the deploy, as well as the decisions you need to make during the install. This information informs your later choice of automation framework, and enables you to identify which parts of your configuration change from install to install. As a bonus, you’ve taken a first pass at automating your process, which will speed up your deploys and help you select an automation framework that best fits your use case. For an exploration of this topic and an introduction to taking it a step further into early Configuration Management, check out my former colleague Mike Stahnke’s dead-on 2013 presentation “Getting Started With Puppet.”

Environment Parity and Configuration Management

We’ve all been on some side of the environment parity issue. Code makes it into production that didn’t take into account some critical element of the production environment — a firewall, different networking configuration, different system version, and so on. The invariable response from Operations is, “Developers don’t understand real operating environments.” The colloquial version of this is, “It works on my laptop!”

The more common truth is that Operations didn’t provide Development with an environment that looked anything like production, or even with the tools to know about or understand what the production environment looks like. As an Operations team, if you don’t offer Development a prod-like environment to deploy into and test with, you cede your right to complain about code they produce that doesn’t match prod.

Since it is often not possible to give developers an exact copy of production, it’s important for the Operations team to abstract away as many changes between environments as is possible. Dev, Prod, QA and all other teams should be running the same OS versions and patch sets, with the same dependencies and same system configuration across the board. The most sensible way to do this is with Configuration Management. Configure all of your environments using the same tools and — most critically — with the same configuration management scripts. The differences between your environments should be a set of variables that inform that code.

If you can’t reduce the differences between your environments to code informed by variables, you’ve identified some hard problems your developers and operations teams are going to have to bridge together. At the very least, if you can make your environments more similar, you can significantly reduce the number of factors that must be taken into account when an app fails in one environment when it succeeded in another.

Get Operations out of the Dev/QA Cycle

The notion of Operations being required to install applications into a QA/Testing environment always baffled me. I was in favor of Development not doing the install themselves, but I also understood that opening a ticket with Operations and waiting for an install is a time-intensive process, and that debugging/troubleshooting is a highly interactive one. These two needs are at odds. By slowing down the Dev/QA feedback loop, Operations not only causes Development to become less efficient, it also encourages developers to do larger chunks of work and submit them for testing less frequently.

The flip side of this is that allowing developers full root access on QA servers is potentially dangerous. Developers may inadvertently make changes that change the performance of the servers from production. Similarly, if developers are installing directly into QA, operations doesn’t get to look at the deployable until it reaches production. When they install the application for the first time, it’s in the most critical environment.

There’s a three-part fix for this:

  • Developers are responsible for deliverables in a consistent format. Whether that’s a package, a tarball, or a tagged git checkout, the deliverable must look the same from release to release

  • QA is managed via Configuration Management, and applications are installed into QA using the same automation tools/scripts used in production.

  • Operations’ SLA for QA is that it will flatten and re-provision the environment when needed. If a deployment screws up the server, Ops will provide a new, clean server.

Using these policies, the application is installed into QA and any subsequent environments with the same scripts. If we’ve learned anything from the Lean movement, it’s that accuracy can be improved by reducing batch sizes, increasing the speed of processing and baking QA into the process. With these changes, the deployment scripts and artifacts are tested dozens, hundreds or thousands of times before they are ever used in production. This can help find deployment problems and iron out scripts long before code ever reaches user-facing systems.

The benefits for both teams are clear: Development gets a fast turnaround time for QA, Operations gets a clean deliverable that can be deployed via its own scripts.

Functional Testing

While there will always be the need for manual testing of certain functionality, establishing an automated testing regimen can provide quick feedback about whether an app is functioning as intended.

While an overview of testing strategies is beyond the scope of this article (Chapter 4 of Jez Humble and David Farley’s book Continuous Delivery provides an excellent overview), I’d argue for prioritizing a combination of functional and integration tests. You want to confirm that the app does what is intended. Simple smoke tests to verify that a server is configured properly and that an application is installed and running is a good first pass at a testing regimen.

Once you get comfortable writing tests, you should begin doing more involved testing of application and server behavior and performance. Every time you make a change that alters the behavior of the application or underlying system, add a test. Down the line you may want to consider TDD or BDD, but start small — having imperfect tests is better than having no tests at all.

At the application level, your development team likely has a testing language or suite for unit and integration tests. There are a number of frameworks you can use for doing this at the server/Configuration Management level. I have used both serverspec and Beaker with success in the past.

The first time you run a proposed configuration management change through tests and discover that it would break your application is a revelation. Similarly, the first time you prevent a regression by adding a check for something that “always” breaks will be the last time somebody accidentally breaks it.

Wrapping Up

We’ve just scratched the surface of what can be done with an existing environment, but as you can hopefully see, there’s plenty you can do right now to get your environment ready for IaaS (and eventually, PaaS) without touching your application’s code.

Remember that this process should be iterative — unless you have the budget to build a greenfield environment tomorrow, you are going to be tackling this one piece at a time. Don’t feel ashamed because your environments aren’t automated enough or you don’t have comprehensive enough tests for your application. Rather, focus on making things better. If you don’t have enough automation, build more. If there aren’t enough good tests, write just one. Then re-examine your environment, see what most needs improvement, and iterate there.

There’s no way to completely move an app without touching the code, but there’s plenty of work to do before you get there in preparation of scalable, loosely coupled code. Don’t wait for the perfect application to start doing the right thing.

December 16, 2014

Day 16 - How to Interview Systems Administrators

Written by: Corey Quinn (@quinnypig)
Edited by: Justin Garrison (@rothgar)

There are many blog posts, articles, and even books[0] written on how to effectively interview software engineers. Hiring systems administrators[1] is a bit more prickly of a topic, for a few reasons.

  • You generally hire fewer of them than you do developers[2].
  • A systems administrator likely has root in production. Mistakes will show more readily, and in many environments “peer review” is an aspiration rather than the current state of things.
  • It’s extremely easy to let your systems administration team become “the department of no.” This can have an echo effect that pumps toxicity into your organization. It’s important to hire someone who isn’t going to add overwhelming negativity.

Every job interview since the beginning of time is built around asking candidates three questions. They’ll take different forms, and you’ll dress them up differently each time, but they can be distilled down as follows.

  1. Can you do the job?
  2. Will you like doing the job?
  3. Can we stand working with you?

Doing the Job

This is where the barrage of technical questions comes in. Be careful when selecting what technical areas you want to cover, and how you cover them. Going into stupendous depth on SAN management when you don’t have centralized storage at all is something of a waste of time.

Additionally, many shops equate trivia with mastery of a subject. “Which format specifier to date(1) will spit out the seconds since the Unix epoch began?” The correct answer is of course “man date” unless they, for some reason, have %s memorized– but what does a right answer really tell you past a single bit of data? Being able to successfully memorize trivia doesn’t really speak to someone’s ability to successfully perform in an operational role.

Instead, it probably makes more sense for you to ask open ended questions about things you care about. “So, we have a lot of web servers here. What’s your experience with managing them? What other technology have you worked with in conjunction with serving data over http/https?” This gleans a lot more data than asking trivia questions about configuring virtual hosts in Apache’s httpd. Be aware that some folks will try to talk around the question; politely returning to specific scenarios can help refocus them.

Liking the Job

Hiring people, training them, and the rest of the onboarding process are expensive. Having to replace someone who left due to poor fit, a skills mismatch, or other reasons two months into the job is awful. It’s important to suss out whether or not the candidate is likely to enjoy their work. That said, it’s sometimes difficult to ascertain whether or not the candidate is just telling you what you want to hear. To that end, ask the candidate for specific stories regarding their current and past work. “Tell me about a time you had to deal with a difficult situation.” Push for specific details– you don’t want to hear “the right answer,” you want to know what actually happened.

This questioning technique leads well into the third question…

Not Being a Jerk

If you think back across your career, you can probably think of a systems administrator you’ve met who could easily be named Surly McBastard. You really, really, really don’t want to hire that person. It’s very easy for the sysadmin group to gain the reputation as “the department of no” just due to their job function alone– remember, their goal is stability above all else. Your engineering group (presuming a separate and distinct team from the operations group) is trying to roll new features out. This gives way to a natural tension in most organizations. There’s no need to exacerbate this by hiring someone who’s difficult to work with.

A key indicator here is fanaticism. We all have our favorite pet technologies, but most of us are able to put personal preferences aside in favor of the prevailing consensus. A subset of technologists are unable to do this. “You use Redis? Why?! It’s a steaming pile of crap!” is a great example of what you might not want to hear. A better way for a candidate to frame this sentiment might be “Oh, you’re a Redis shop? That’s interesting– I’ve run into some challenges with it in the past. I’d be very curious to hear how you’ve overcome some challenges…”

Remember, the successful candidate is going to have to deal with other groups of people, and that’s a very challenging thing to interview for. It also helps to remember that interviewing is an inexact science, and everyone approaches it with a number of biases.

For this reason, I strongly recommend having multiple interviewers speak to each candidate, and then compare notes afterwards. It’s entirely possible that one person will pick up on a red flag that others will miss.

Ultimately, interviewing is a challenge on both sides of the table. The best way to improve is to practice– take notes on what works, what doesn’t, and adjust accordingly. Remember that every hire you make shifts your team; ideally you want that to be trending upwards with each successive hire.

[0] I’m partial to myself.
[1] For purposes of this article, “systems administrators” can be expanded to include operations engineers, devops unicorns, network engineers, database wizards, storage gurus, infrastructure perverts, NOC technicians, and other similar roles.
[2] For purposes of this article, “developers” can be expanded to include… you get the idea.

December 15, 2014

Day 15 - Cook your own packages: Getting more out of fpm

Written by: Mathias Lafeldt (@mlafeldt)
Edited by: Joseph Kern (@josephkern)


When it comes to building packages, there is one particular tool that has grown in popularity over the last years: fpm. fpm’s honorable goal is to make it as simple as possible to create native packages for multiple platforms, all without having to learn the intricacies of each distribution’s packaging format (.deb, .rpm, etc.) and tooling.

With a single command, fpm can build packages from a variety of sources including Ruby gems, Python modules, tarballs, and plain directories. Here’s a quick example showing you how to use the tool to create a Debian package of the AWS SDK for Ruby:

$ fpm -s gem -t deb aws-sdk
Created package {:path=>"rubygem-aws-sdk_1.59.0_all.deb"}

It is this simplicity that makes fpm so popular. Developers are able to easily distribute their software via platform-native packages. Businesses can manage their infrastructure on their own terms, independent of upstream vendors and their policies. All of this has been possible before, but never with this little effort.

In practice, however, things are often more complicated than the one-liner shown above. While it is absolutely possible to provision production systems with packages created by fpm, it will take some work to get there. The tool can only help you so far.

In this post we’ll take a look at several best practices covering: dependency resolution, reproducible builds, and infrastructure as code. All examples will be specific to Debian and Ruby, but the same lessons apply to other platforms/languages as well.

Resolving dependencies

Let’s get back to the AWS SDK package from the introduction. With a single command, fpm converts the aws-sdk Ruby gem to a Debian package named rubygem-aws-sdk. This is what happens when we actually try to install the package on a Debian system:

$ sudo dpkg --install rubygem-aws-sdk_1.59.0_all.deb
dpkg: dependency problems prevent configuration of rubygem-aws-sdk:
 rubygem-aws-sdk depends on rubygem-aws-sdk-v1 (= 1.59.0); however:
  Package rubygem-aws-sdk-v1 is not installed.

As we can see, our package can’t be installed due to a missing dependency (rubygem-aws-sdk-v1). Let’s take a closer look at the generated .deb file:

$ dpkg --info rubygem-aws-sdk_1.59.0_all.deb
 Package: rubygem-aws-sdk
 Version: 1.59.0
 License: Apache 2.0
 Vendor: Amazon Web Services
 Architecture: all
 Maintainer: <vagrant@wheezy-buildbox>
 Installed-Size: 5
 Depends: rubygem-aws-sdk-v1 (= 1.59.0)
 Provides: rubygem-aws-sdk
 Section: Languages/Development/Ruby
 Priority: extra
 Description: Version 1 of the AWS SDK for Ruby. Available as both `aws-sdk` and `aws-sdk-v1`.
  Use `aws-sdk-v1` if you want to load v1 and v2 of the Ruby SDK in the same

fpm did a great job at populating metadata fields such as package name, version, license, and description. It also made sure that the Depends field contains all required dependencies that have to be installed for our package to work properly. Here, there’s only one direct dependency – the one we’re missing.

While fpm goes to great lengths to provide proper dependency information – and this is not limited to Ruby gems – it does not automatically build those dependencies. That’s our job. We need to find a set of compatible dependencies and then tell fpm to build them for us.

Let’s build the missing rubygem-aws-sdk-v1 package with the exact version required and then observe the next dependency in the chain:

$ fpm -s gem -t deb -v 1.59.0 aws-sdk-v1
Created package {:path=>"rubygem-aws-sdk-v1_1.59.0_all.deb"}

$ dpkg --info rubygem-aws-sdk-v1_1.59.0_all.deb | grep Depends
 Depends: rubygem-nokogiri (>= 1.4.4), rubygem-json (>= 1.4), rubygem-json (<< 2.0)

Two more packages to take care of: rubygem-nokogiri and rubygem-json. By now, it should be clear that resolving package dependencies like this is no fun. There must be a better way.

In the Ruby world, Bundler is the tool of choice for managing and resolving gem dependencies. So let’s ask Bundler for the dependencies we need. For this, we create a Gemfile with the following content:

# Gemfile
source ""
gem "aws-sdk", "= 1.59.0"
gem "nokogiri", "~> 1.5.0" # use older version of Nokogiri

We then instruct Bundler to resolve all dependencies and store the resulting .gem files into a local folder:

$ bundle package
Updating files in vendor/cache
  * json-1.8.1.gem
  * nokogiri-1.5.11.gem
  * aws-sdk-v1-1.59.0.gem
  * aws-sdk-1.59.0.gem

We specifically asked Bundler to create .gem files because fpm can convert them into Debian packages in a matter of seconds:

$ find vendor/cache -name '*.gem' | xargs -n1 fpm -s gem -t deb
Created package {:path=>"rubygem-aws-sdk-v1_1.59.0_all.deb"}
Created package {:path=>"rubygem-aws-sdk_1.59.0_all.deb"}
Created package {:path=>"rubygem-json_1.8.1_amd64.deb"}
Created package {:path=>"rubygem-nokogiri_1.5.11_amd64.deb"}

As a final test, let’s install those packages…

$ sudo dpkg -i *.deb
Setting up rubygem-json (1.8.1) ...
Setting up rubygem-nokogiri (1.5.11) ...
Setting up rubygem-aws-sdk-v1 (1.59.0) ...
Setting up rubygem-aws-sdk (1.59.0) ...

…and verify that the AWS SDK actually can be used by Ruby:

$ ruby -e "require 'aws-sdk'; puts AWS::VERSION"


The purpose of this little exercise was to demonstrate one effective approach to resolving package dependencies for fpm. By using Bundler – the best tool for the job – we get fine control over all dependencies, including transitive ones (like Nokogiri, see Gemfile). Other languages provide similar dependency tools. We should make use of language specific tools whenever we can.

Build infrastructure

After learning how to build all packages that make up a piece of software, let’s consider how to integrate fpm into our build infrastructure. These days, with the rise of the DevOps movement, many teams have started to manage their own infrastructure. Even though each team is likely to have unique requirements, it still makes sense to share a company-wide build infrastructure, as opposed to reinventing the wheel each time someone wants to automate packaging.

Packaging is often only a small step in a longer series of build steps. In many cases, we first have to build the software itself. While fpm supports multiple source formats, it doesn’t know how to build the source code or determine dependencies required by the package. Again, that’s our job.

Creating a consistent build and release process for different projects across multiple teams is hard. Fortunately, there’s another tool that does most of the work for us: fpm-cookery. fpm-cookery sits on top of fpm and provides the missing pieces to create a reusable build infrastructure. Inspired by projects like Homebrew, fpm-cookery builds packages based on simple recipes written in Ruby.

Let’s turn our attention back to the AWS SDK. Remember how we initially converted the gem to a Debian package? As a warm up, let’s do the same with fpm-cookery. First, we have to create a recipe.rb file:

# recipe.rb
class AwsSdkGem < FPM::Cookery::RubyGemRecipe
  name    "aws-sdk"
  version "1.59.0"

Next, we pass the recipe to fpm-cook, the command-line tool that comes with fpm-cookery, and let it build the package for us:

$ fpm-cook package recipe.rb
===> Starting package creation for aws-sdk-1.59.0 (debian, deb)
===> Verifying build_depends and depends with Puppet
===> All build_depends and depends packages installed
===> [FPM] Trying to download {"gem":"aws-sdk","version":"1.59.0"}
===> Created package: /home/vagrant/pkg/rubygem-aws-sdk_1.59.0_all.deb

To complete the exercise, we also need to write a recipe for each remaining gem dependency. This is what the final recipes look like:

# recipe.rb
class AwsSdkGem < FPM::Cookery::RubyGemRecipe
  name       "aws-sdk"
  version    "1.59.0"
  maintainer "Mathias Lafeldt <>"

  chain_package true
  chain_recipes ["aws-sdk-v1", "json", "nokogiri"]

# aws-sdk-v1.rb
class AwsSdkV1Gem < FPM::Cookery::RubyGemRecipe
  name       "aws-sdk-v1"
  version    "1.59.0"
  maintainer "Mathias Lafeldt <>"

# json.rb
class JsonGem < FPM::Cookery::RubyGemRecipe
  name       "json"
  version    "1.8.1"
  maintainer "Mathias Lafeldt <>"

# nokogiri.rb
class NokogiriGem < FPM::Cookery::RubyGemRecipe
  name       "nokogiri"
  version    "1.5.11"
  maintainer "Mathias Lafeldt <>"

  build_depends ["libxml2-dev", "libxslt1-dev"]
  depends       ["libxml2", "libxslt1.1"]

Running fpm-cook again will produce Debian packages that can be added to an APT repository and are ready for use in production.

Three things worth highlighting:

  • fpm-cookery is able to build multiple dependent packages in a row (configured by chain_* attributes), allowing us to build everything with a single invocation of fpm-cook.
  • We can use the attributes build_depends and depends to specify a package’s build and runtime dependencies. When running fpm-cook as root, the tool will automatically install missing dependencies for us.
  • I deliberately set the maintainer attribute in all recipes. It’s important to take responsibility of the work that we do. We should make it as easy as possible for others to identify the person or team responsible for a package.

fpm-cookery provides many more attributes to configure all aspects of the build process. Among other things, it can download source code from GitHub before running custom build instructions (e.g. make install). The fpm-recipes repository is an excellent place to study some working examples. This final example, a recipe for chruby, is a foretaste of what fpm-cookery can actually do:

# recipe.rb
class Chruby < FPM::Cookery::Recipe
  description "Changes the current Ruby"

  name     "chruby"
  version  "0.3.8"
  homepage ""
  source   "{version}.tar.gz"
  sha256   "d980872cf2cd047bc9dba78c4b72684c046e246c0fca5ea6509cae7b1ada63be"

  maintainer "Jan Brauer <>"

  section "development"

  config_files "/etc/profile.d/"

  def build
    # nothing to do here

  def install
    make :install, "PREFIX" => prefix
    etc("profile.d").install workdir("")

source /usr/share/chruby/

Wrapping up

fpm has changed the way we build packages. We can get even more out of fpm by using it in combination with other tools. Dedicated programs like Bundler can help us with resolving package dependencies, which is something fpm won’t do for us. fpm-cookery adds another missing piece: it allows us to describe our packages using simple recipes, which can be kept under version control, giving us the benefits of infrastructure as code: repeatability, automation, rollbacks, code reviews, etc.

Last but not least, it’s a good idea to pair fpm-cookery with Docker or Vagrant for fast, isolated package builds. This, however, is outside the scope of this article and left as an exercise for the reader.

Further reading

December 14, 2014

Day 14 - Using Chef Provisioning to Build Chef Server

Or, Yo Dawg, I heard you like Chef.

Written by: Joshua Timberman (@jtimberman)
Edited by: Paul Graydon (@twirrim)

This post is dedicated to Ezra Zygmuntowicz. Without Ezra, we wouldn’t have had Merb for the original Chef server, chef-solo, and maybe not even Chef itself. His contributions to the Ruby, Rails, and Chef communities are immense. Thanks, Ezra, RIP.

In this post, I will walk through a use case for Chef Provisioning used at Chef Software, Inc.: building a new Hosted Chef infrastructure with Chef Server 12 on Amazon EC2. This isn’t an in-depth how to guide, but I will illustrate the important components to discuss what is required to setup Chef Provisioning, with a real world example. Think of it as a whirlwind tour of Chef Provisioning and Chef Server 12.


If you have used Chef for awhile, you may recall the wiki page “Bootstrap Chef RubyGems Installation” - the installation guide that uses cookbooks with chef-solo to install all the components required to run an open source Chef Server. This idea was a natural fit in the omnibus packages for Enterprise Chef (nee Private Chef) in the form of private-chef-ctl reconfigure: that command kicks off a chef-solo run that configures and starts all the Chef Server services.

It should be no surprise, that at CHEF we build Hosted Chef using Chef. Yes, it’s turtles and yo-dawg jokes all the way down. As the CHEF CTO Adam described when talking about one Chef Server codebase, we want to bring our internal deployment and development practices in line with what we’re shipping to customers, and we want to unify our approach so we can provide better support.

Chef Server 12

As announced recently, Chef Server 12 is generally available. For purposes of the example discussed below, we’ll provision three machines: one backend, one frontend (with Chef Manage and Chef Reporting), and one running Chef Analytics. While Chef Server 12 has the capability to install add-ons, we have a special cookbook with a resource to manage the installation of “Chef Server Ingredients.” This is so we can also install the chef-server-core package used by both the API frontend nodes and the backend nodes.

Chef Provisioning

Chef Provisioning is a new capability for Chef, where users can define “machines” as Chef resources in recipes, and then converge those recipes on a node. This means that new machines are created using a variety of possible providers (AWS, OpenStack, or Docker, to name a few), and they can have recipes applied from other cookbooks available on the Chef Server.

Chef Provisioning “runs” on a provisioner node. This is often a local workstation, but it could be a specially designated node in a data center or cloud provider. It is simply a recipe run by chef-client (or chef-solo). When using chef-client, any Chef Server will do, including Hosted Chef. Of course, the idea here is we don’t have a Chef Server yet. In my examples in this post, I’ll use my OS X laptop as the provisioner, and Chef Zero as the server.

Assemble the Pieces

The cookbook that does the work using Chef Provisioning is chef-server-cluster. Note that this cookbook is under active development, and the code it contains may differ from the code in this post. As such, I’ll post relevant portions to show the use of Chef Provisioning, and the supporting local setup required to make it go. Refer to the in the cookbook for the most recent information on how to use it.

Amazon Web Services EC2

The first thing we need is an AWS account for the EC2 instances. Once we have that, we need an IAM user that has privileges to manage EC2, and an SSH keypair to log into the instances. It is outside the scope of this post to provide details on how to assemble those pieces. However once those are acquired, do the following:

Put the access key and secret access key configuration in ~/.aws/config. This is automatically used by chef-provisioning’s AWS provider. The SSH keys will be used in a data bag item (JSON) that is described later. You will then want to choose an AWS region to use. For sake of example, my keypair is named hc-metal-provisioner in the us-west-2 region.

Chef Provisioning needs to know about the SSH keys in three places:

  1. In the .chef/knife.rb, the private_keys and public_keys configuration settings.
  2. In the machine_options that is used to configure the (AWS) driver so it can connect to the machine instances.
  3. In a recipe.

This is described in more detail below.

Chef Repository

We use a Chef Repository to store all the pieces and parts for the Hosted Chef infrastructure. For example purposes I’ll use a brand new repository. I’ll use ChefDK’s chef generate command:

% chef generate repo sysadvent-chef-cluster

This repository will have a Policyfile.rb, a .chef/knife.rb config file, and a couple of data bags. The latest implementation specifics can be found in the chef-server-cluster cookbook’s

Chef Zero and Knife Config

As mentioned above, Chef Zero will be the Chef Server for this example, and it will run on a specific port (7799). I started it up in a separate terminal with:

% chef-zero -l debug -p 7799

The knife config file will serve two purposes. First, it will be used to load all the artifacts into Chef Zero. Second, it will provide essential configuration to use with chef-client. Let’s look at the required configuration.

This portion tells chef, knife, and chef-client to use the chef-zero instance started earlier.

chef_server_url 'http://localhost:7799'
node_name       'chef-provisioner'

In the next section, I’ll discuss the policyfile feature in more detail. These configuration settings tell chef-client to use policyfiles, and which deployment group the client should use.

use_policyfile   true
deployment_group 'sysadvent-demo-provisioner'

As mentioned above, these are the configuration options that tell Chef Provisioning where the keys are located. The key files must exist on the provisioning node somewhere.

First here’s the knife config:

private_keys     'hc-metal-provisioner' => '/tmp/ssh/id_rsa'
public_keys      'hc-metal-provisioner' => '/tmp/ssh/'

Then the recipe - this is from the current version of chef-server-cluster::setup-ssh-keys.

fog_key_pair node['chef-server-cluster']['aws']['machine_options']['bootstrap_options']['key_name'] do
  private_key_path '/tmp/ssh/id_rsa'
  public_key_path '/tmp/ssh/'

The attribute here is part of the driver options set using the with_machine_options method for Chef Provisioning in chef-server-cluster::setup-provisioner. For further reading about machine options, see Chef Provisioning configuration documentation. While the machine options will automatically use keys stored in ~/.chef/keys or ~/.ssh, we do this to avoid strange conflicts on local development systems used for test provisioning. An issue has been opened to revisit this.


Beware, gentle reader! This is an experimental new feature that mayWwill change. However, I wanted to try it out, as it made sense for the workflow when I was assembling this post. Read more about Policyfiles in the ChefDK repository. In particular, read the “Motivation and FAQ” section. Also, Chef (client) 12 is required, which is included in the ChefDK package I have installed on my provisioning system.

The general idea behind Policyfiles is to assemble node’s run list as an artifact, including all the roles and recipes needed to fulfill its job in the infrastructure. Each policyfile.rb contains at least the following.

  • name: the name of the policy
  • run_list: the run list for nodes that use this policy
  • default_source: the source where cookbooks should be downloaded (e.g., Supermarket)
  • cookbook: define the cookbooks required to fulfill this policy

As an example, here is the Policyfile.rb I’m using, at the toplevel of the repository:

name            'sysadvent-demo'
run_list        'chef-server-cluster::cluster-provision'
default_source  :community
cookbook        'chef-server-ingredient', '>= 0.0.0',
                :github => 'opscode-cookbooks/chef-server-ingredient'
cookbook        'chef-server-cluster', '>= 0.0.0',
                :github => 'opscode-cookbooks/chef-server-cluster'

Once the Policyfile.rb is written, it needs to be compiled to a lock file (Policyfile.lock.json) with chef install. Installing the policy does the following.

  • Build the policy
  • “Install” the cookbooks to the cookbook store (~/.chefdk/cache/cookbooks)
  • Write the lockfile

This doesn’t put the cookbooks (or the policy) on the Chef Server. We’ll do that in the upload section with chef push.

Data Bags

At CHEF, we prefer to move configurable data and secrets to data bags. For secrets, we generally use Chef Vault, though for the purpose of this example we’re going to skip that here. The chef-server-cluster cookbook has a few data bag items that are required before we can run Chef Client.

Under data_bags, I have these directories/files.

  • secrets/hc-metal-provisioner-chef-aws-us-west-2.json: the name hc-metal-provisioner-chef-aws-us-west-2 is an attribute in the chef-server-cluster::setup-ssh-keys recipe to load the correct item; the private and public SSH keys for the AWS keypair are written out to /tmp/ssh on the provisioner node
  • secrets/private-chef-secrets-_default.json: the complete set of secrets for the Chef Server systems, written to /etc/opscode/private-chef-secrets.json
  • chef_server/topology.json: the topology and configuration of the Chef Server. Currently this doesn’t do much but will be expanded in future to inform /etc/opscode/chef-server.rb with more configuration options

See the chef-server-cluster cookbook for the latest details about the data bag items required. Note At this time, chef-vault is not used for secrets, but that will change in the future.

Upload the Repository

Now that we’ve assembled all the required components to converge the provisioner node and start up the Chef Server cluster, let’s get everything loaded on the Chef Server.

Ensure the policyfile is compiled and installed, then push it as the provisioner deployment group. The group name is combined with the policy name in the config that we saw earlier in knife.rb. The chef push command uploads the cookbooks, and also creates a data bag item that stores the policyfile’s rendered JSON.

% chef install
% chef push provisioner

Next, upload the data bags.

% knife upload data_bags

We can now use knife to confirm that everything we need is on the Chef Server:

% knife data bag list
% knife cookbook list
apt                      11131342171167261.63923027125258247.235168191861173
chef-server-cluster      2285060862094129.64629594500995644.198889591798187
chef-server-ingredient   37684361341419357.41541897591682737.246865540583454
chef-vault               11505292086701548.4466613666701158.13536425383812

What’s with those crazy versions? That is what the policyfile feature does. The human readable versions are no longer used, cookbook versions are locked using unique, automatically generated version strings, so based on the policy we know the precise cookbook dependency graph for any given policy. When Chef runs on the provisioner node, it will use the versions in its policy. When Chef runs on the machine instances, since they’re not using Policyfiles, it will use the latest version. In the future we’ll have policies for each of the nodes that are managed with Chef Provisioning.


At this point, we have:

  • ChefDK installed on the local privisioning node (laptop) with Chef client version 12
  • AWS IAM user credentials in ~/.aws/config for managing EC2 instances
  • A running Chef Server using chef-zero on the local node
  • The chef-server-cluster cookbook and its dependencies
  • The data bag items required to use chef-server-cluster’s recipes, including the SSH keys Chef Provisioning will use to log into the EC2 instances
  • A knife.rb config file that will point chef-client at the chef-zero server, and tells it to use policyfiles

Chef Client

Finally, the moment (or several moments…) we have been waiting for! It’s time to run chef-client on the provisioning node.

% chef-client -c .chef/knife.rb

While that runs, let’s talk about what’s going on here.

Normally when chef-client runs, it reads configuration from /etc/chef/client.rb. As I mentioned, I’m using my laptop, which has its own run list and configuration, so I need to specify the knife.rb discussed earlier. This will use the chef-zero Chef Server running on port 7799, and the policyfile deployment group.

In the output, we’ll see Chef get its run list from the policy file, which looks like this:

resolving cookbooks for run list: ["chef-server-cluster::cluster-provision@0.0.7 (081e403)"]
Synchronizing Cookbooks:
  - chef-server-ingredient
  - chef-server-cluster
  - apt
  - chef-vault

The rest of the output should be familiar to Chef users, but let’s talk about some of the things Chef Provisioning is doing. First, the following resource is in the chef-server-cluster::cluster-provision recipe:

machine 'bootstrap-backend' do
  recipe 'chef-server-cluster::bootstrap'
  ohai_hints 'ec2' => '{}'
  action :converge
  converge true

The first system that we build in a Chef Server cluster is a backend node that “bootstraps” the data store that will be used by the other nodes. This includes the postgresql database, the RabbitMQ queues, etc. Here’s the output of Chef Provisioning creating this machine resource.

Recipe: chef-server-cluster::cluster-provision
  * machine[bootstrap-backend] action converge
    - creating machine bootstrap-backend on fog:AWS:862552916454:us-west-2
    -   key_name: "hc-metal-provisioner"
    -   image_id: "ami-b99ed989"
    -   flavor_id: "m3.medium"
    - machine bootstrap-backend created as i-14dec01b on fog:AWS:862552916454:us-west-2
    - Update tags for bootstrap-backend on fog:AWS:862552916454:us-west-2
    -   Add Name = "bootstrap-backend"
    -   Add BootstrapId = "http://localhost:7799/nodes/bootstrap-backend"
    -   Add BootstrapHost = "champagne.local"
    -   Add BootstrapUser = "jtimberman"
    - create node bootstrap-backend at http://localhost:7799
    -   add normal.tags = nil
    -   add normal.chef_provisioning = {"location"=>{"driver_url"=>"fog:AWS:XXXXXXXXXXXX:us-west-2", "driver_version"=>"0.11", "server_id"=>"i-14dec01b", "creator"=>"user/IAMUSERNAME, "allocated_at"=>1417385355, "key_name"=>"hc-metal-provisioner", "ssh_username"=>"ubuntu"}}
    -   update run_list from [] to ["recipe[chef-server-cluster::bootstrap]"]
    - waiting for bootstrap-backend (i-14dec01b on fog:AWS:XXXXXXXXXXXX:us-west-2) to be ready ...
    - bootstrap-backend is now ready
    - waiting for bootstrap-backend (i-14dec01b on fog:AWS:XXXXXXXXXXXX:us-west-2) to be connectable (transport up and running) ...
    - bootstrap-backend is now connectable
    - generate private key (2048 bits)
    - create directory /etc/chef on bootstrap-backend
    - write file /etc/chef/client.pem on bootstrap-backend
    - create client bootstrap-backend at clients
    -   add public_key = "-----BEGIN PUBLIC KEY-----\n..."
    - create directory /etc/chef/ohai/hints on bootstrap-backend
    - write file /etc/chef/ohai/hints/ec2.json on bootstrap-backend
    - write file /etc/chef/client.rb on bootstrap-backend
    - write file /tmp/ on bootstrap-backend
    - run 'bash -c ' bash /tmp/'' on bootstrap-backend

From here, Chef Provisioning kicks off a chef-client run on the machine it just created. This script is the one that uses CHEF’s omnitruck service. It will install the current released version of Chef, which is 11.16.4 at the time of writing. Note that this is not version 12, so that’s another reason we can’t use Policyfiles on the machines. The chef-client run is started on the backend instance using the run list specified in the machine resource.

Starting Chef Client, version 11.16.4
 resolving cookbooks for run list: ["chef-server-cluster::bootstrap"]
 Synchronizing Cookbooks:
   - chef-server-cluster
   - chef-server-ingredient
   - chef-vault
   - apt

In the output, we see this recipe and resource:

Recipe: chef-server-cluster::default
  * chef_server_ingredient[chef-server-core] action reconfigure
    * execute[chef-server-core-reconfigure] action run
      - execute chef-server-ctl reconfigure

An “ingredient” is a Chef Server component, either the core package (above), or one of the Chef Server add-ons like Chef Manage or Chef Reporting. In normal installation instructions for each of the add-ons, their appropriate ctl reconfigure is run, which is all handled by the chef_server_ingredient resource. The reconfigure actually runs Chef Solo, so we’re running chef-solo in a chef-client run started inside a chef-client run.

The bootstrap-backend node generates some files that we need on other nodes. To make those available using Chef Provisioning, we use machine_file resources.

%w{ actions-source.json webui_priv.pem }.each do |analytics_file|
  machine_file "/etc/opscode-analytics/#{analytics_file}" do
    local_path "/tmp/stash/#{analytics_file}"
    machine 'bootstrap-backend'
    action :download

machine_file '/etc/opscode/webui_pub.pem' do
  local_path '/tmp/stash/webui_pub.pem'
  machine 'bootstrap-backend'
  action :download

These are “stashed” on the local node - the provisioner. They’re used for Chef Manage webui, and the Chef Analytics node. When the recipe runs on the provisioner, we see this output:

  * machine_file[/etc/opscode-analytics/actions-source.json] action download
    - download file /etc/opscode-analytics/actions-source.json on bootstrap-backend to /tmp/stash/actions-source.json
  * machine_file[/etc/opscode-analytics/webui_priv.pem] action download
    - download file /etc/opscode-analytics/webui_priv.pem on bootstrap-backend to /tmp/stash/webui_priv.pem
  * machine_file[/etc/opscode/webui_pub.pem] action download
    - download file /etc/opscode/webui_pub.pem on bootstrap-backend to /tmp/stash/webui_pub.pem

They are uploaded to the frontend and analytics machines with the files resource attribute. Files are specified as a hash. The key is the target file to upload to the machine, and the value is the source file from the provisioning node.

machine 'frontend' do
  recipe 'chef-server-cluster::frontend'
        '/etc/opscode/webui_priv.pem' => '/tmp/stash/webui_priv.pem',
        '/etc/opscode/webui_pub.pem' => '/tmp/stash/webui_pub.pem'

machine 'analytics' do
  recipe 'chef-server-cluster::analytics'
        '/etc/opscode-analytics/actions-source.json' => '/tmp/stash/actions-source.json',
        '/etc/opscode-analytics/webui_priv.pem' => '/tmp/stash/webui_priv.pem'

Note These files are transferred using SSH, so they’re not passed around in the clear.

The provisioner will converge the frontend next, followed by the analytics node. We’ll skip the bulk of the output since we saw it earlier with the backend.

  * machine[frontend] action converge
  ... SNIP
    - upload file /tmp/stash/webui_priv.pem to /etc/opscode/webui_priv.pem on frontend
    - upload file /tmp/stash/webui_pub.pem to /etc/opscode/webui_pub.pem on frontend

Here is where the files are uploaded to the frontend, so the webui will work (it’s an API client itself, like knife, or chef-client).

When the frontend runs chef-client, not only does it install the chef-server-core and run chef-server-ctl reconfigure via the ingredient resource, it also gets the manage and reporting addons:

* chef_server_ingredient[opscode-manage] action install
  * package[opscode-manage] action install
    - install version 1.6.2-1 of package opscode-manage
* chef_server_ingredient[opscode-reporting] action install
   * package[opscode-reporting] action install
     - install version 1.2.1-1 of package opscode-reporting
Recipe: chef-server-cluster::frontend
  * chef_server_ingredient[opscode-manage] action reconfigure
    * execute[opscode-manage-reconfigure] action run
      - execute opscode-manage-ctl reconfigure
  * chef_server_ingredient[opscode-reporting] action reconfigure
    * execute[opscode-reporting-reconfigure] action run
      - execute opscode-reporting-ctl reconfigure

Similar to the frontend above, the analytics node will be created as an EC2 instance, and we’ll see the files uploaded:

    - upload file /tmp/stash/actions-source.json to /etc/opscode-analytics/actions-source.json on analytics
    - upload file /tmp/stash/webui_priv.pem to /etc/opscode-analytics/webui_priv.pem on analytics

Then, the analytics package is installed as an ingredient, and reconfigured:

* chef_server_ingredient[opscode-analytics] action install
* package[opscode-analytics] action install
  - install version 1.0.4-1 of package opscode-analytics
* chef_server_ingredient[opscode-analytics] action reconfigure
  * execute[opscode-analytics-reconfigure] action run
    - execute opscode-analytics-ctl reconfigure
Chef Client finished, 10/15 resources updated in 1108.3078 seconds

This will be the last thing in the chef-client run on the provisioner, so let’s take a look at what we have.

Results and Verification

We now have three nodes running as EC2 instances for the backend, frontend, and analytics systems in the Chef Server. We can view the node objects on our chef-zero server:

% knife node list

We can use search:

% knife search node 'ec2:*' -r
3 items found

  run_list: recipe[chef-server-cluster::analytics]

  run_list: recipe[chef-server-cluster::bootstrap]

  run_list: recipe[chef-server-cluster::frontend]

% knife search node 'ec2:*' -a ipaddress
3 items found




If we navigate to the frontend IP, we can sign up using the Chef Server management console, then download a starter kit and use that to bootstrap new nodes against the freshly built Chef Server.

% unzip
  inflating: chef-repo/.chef/sysadvent-demo.pem
  inflating: chef-repo/.chef/sysadvent-demo-validator.pem
% cd chef-repo
% knife client list
% knife node create sysadvent-node1 -d
Created node[sysadvent-node1]

If we navigate to the analytics IP, we can sign in with the user we just created, and view the events from downloading the starter kit: the validator client key was regenerated, and the node was created.

Next Steps

For those following at home, this is now a fully functional Chef Server. It does have premium features (manage, reporting, analytics), but those are free up to 25 nodes. We can also destroy the cluster, using the cleanup recipe. That can be applied by disabling policyfile in .chef/knife.rb:

% grep policyfile .chef/knife.rb
# use_policyfile   true
% chef-client -c .chef/knife.rb -o chef-server-cluster::cluster-clean
Recipe: chef-server-cluster::cluster-clean
  * machine[analytics] action destroy
    - destroy machine analytics (i-5cdac453 at fog:AWS:XXXXXXXXXXXX:us-west-2)
    - delete node analytics at http://localhost:7799
    - delete client analytics at clients
  * machine[frontend] action destroy
    - destroy machine frontend (i-68dfc167 at fog:AWS:XXXXXXXXXXXX:us-west-2)
    - delete node frontend at http://localhost:7799
    - delete client frontend at clients
  * machine[bootstrap-backend] action destroy
    - destroy machine bootstrap-backend (i-14dec01b at fog:AWS:XXXXXXXXXXXXX:us-west-2)
    - delete node bootstrap-backend at http://localhost:7799
    - delete client bootstrap-backend at clients
  * directory[/tmp/ssh] action delete
    - delete existing directory /tmp/ssh
  * directory[/tmp/stash] action delete
    - delete existing directory /tmp/stash

As you can see, the Chef Provisioning capability is powerful, and gives us a lot of flexibility for running a Chef Server 12 cluster. Over time as we rebuild Hosted Chef with it, we’ll add more capability to the cookbook, including HA, scaled out frontends, and splitting up frontend services onto separate nodes.