December 11, 2017

Day 11 - Scaling your on-duty team

By: Damien Pacaud (@damienpacaud)

Edited By: (@bmarsteau)

Our tech team at is mostly based in France where labour law and legislation provide quite a strict set of rules and boundaries for working out of office hours.

For this reason we’ve had to adapt and give some thought to our on-duty team organization as we grew from a start-up to a scale-up.


Scaling your on-duty team is crucial for most of the fast-growing startups that operate at a global level. The internet never sleeps, and even with the best design for resilience, one day, your system will go down. At teads, we deliver outstream video advertising for the biggest content publishers in the world. Any downtime has important repercussions on our revenue but also on the publisher’s revenue. We decided to carefully think about scaling our on-duty team in order to minimize the downtime when a system goes down. That story is below.

Our problem

In a few years, we’ve scaled from a growing startup operating with a few pizza teams into a company where more than 100 developers on 3 different locations deliver new features on a daily basis. We’ve been able to do so by implementing our own version of the “Spotify model” and it has given us the ability to stay agile while growing the tech team. Applying the same recipe to the on-duty team was a challenge, to say the least. Initially, the on-duty team was composed of a few developers that had been with teads since the very beginning and that were very knowledgeable on every part of the platform. We relied on their knowledge, availability and on the fact that they helped build most of the system. As we grew, the system became larger and more complex. The handful of developers keeping the revenue safe overnight were now unable to keep up with the needed knowledge to solve a problem.

First step : Growing the on-duty team

We started looking for people to add to the on-duty team and ideally have someone from each of our feature teams be part of the rotation. This was our way of implementing “you built it, you run it” in a country with strict labour laws. It meant growing that team to 12 people and that’s when we hit the first wall. We tried growing the team while having a few visible production incidents (S3 Service Disruption in us-east-1, anyone ?) and of course, no one was voluntarily applying to be on duty.

spounge bob

Besides, being on duty once every twelve weeks seems counterproductive as it is spread on too long a timespan. By the time you are back on duty a lot of systems have changed and it is difficult to remember good practices.

Lost battle: trying to be ready

One of the main reason nobody was applying to the on-duty rotation was the lack of documentation for how to react when an incident arises. We tried to tackle this problem and for a few months we set up meetings, put knowledgeable people in a room and ask them to kindly document the steps to take when incidents happen.


This was too large of a mission, even for a highly motivated team. Soon, meetings were skipped, and documentation was not improving.

At this point, we started thinking about the problem in a different way.

Enter on-duty pairing

The first decision we took was to have two persons on-duty at the same time for a week-long shift. We tried to wisely choose pairs for mutually exclusive skills set and experience. We will for example pair a back-end developer with a data-oriented developer. This allows to cover most systems on the critical chain.

The benefits that we see with the on-duty pairing are: It’s much easier to bounce ideas off someone when a problem is impacting production and you (or your pair) do not know how to fix it. Sometimes while on-duty, the incident runs so deep that a critical business decision must be taken. It’s much easier to share the responsibility of such a decision in the middle of the night. We accept that this may slow down the decision process as there will be back-and-forth between the pair. In the rare event of someone not waking up to the PagerDuty calls, there is a backup. Interestingly enough, we had never experienced someone not waking up until we started pairing. This brought the question that pairing may lower each individual’s sense of alert because there is a backup but in the end we feel it has more benefits than downsides.


We implemented this change in a few weeks and so far we are quite happy with it. The team has scaled to 12 developers, coming from all feature teams, and the rotation goes smoothly.

Escalation ?

The traditional way of dealing with increasing complexity is to have an escalation policy. We chose not to implement this and have PagerDuty automatically wake up both pairing developers at the same time. This automates the decision of waking-up another human being and makes PagerDuty responsible for it. We don’t want to be responsible for this hard decision so we let the robot do it.


Escalation usually also solves the “I need an expert on [insert any well known distributed system here] and I need her right now” problem. Putting them on escalation policies is great if you have a big enough pool of experts on each of the systems that you use. For us this meant that a few persons would be on call every other week. We thought this was not acceptable and decided that we could solve this by : Telling the on-duty team members we know they will do their best to recover the issue Giving them the confidence that, as engineers, they will find a solution Automating as much as we can routine maintenance operations (taking a bad cassandra node out of the ring, decommissioning and replacing a Kafka broker…)

Post-incident & Playbook

Soon after the incident, we gather everyone from the on-duty team in a room for a blameless, fact-oriented, post-mortem. We aim to leave the room after one hour having filled our very simple post-incident template. Summary of the issue How to reply to such an issue (should it rise again) Action plan

This process allows us to document our interventions and ensure, should the same incident happen, we have a solution to mitigate its effect in a timely manner.


After a few months, we are quite happy with this new on-duty rotation. It has proven useful many times and we now have more documentation than ever on how to react to our alerts. The post-incident ritual also acts as a team bonding meeting and we are thinking of creating more rituals specifically for the on-duty team (on top of each individual’s feature team rituals).

The biggest complexity that we encountered since launching was organizing the Christmas rotation period with pairs. It’s always a challenge to find one person available during those holidays, so trying to find two is double the fun.

December 10, 2017

Day 10 - From Product Eng to Systems Eng

By: Will Gallego (@wcgallego)

Edited By: Dave Mangot (@davemangot)

Engineering is an open field, not a paved road

In engineering, straight lines are few and far between. There is no certain path, no strict guide, no singular right way for every task. There are lots of wrong ways, for sure (being dismissive of others’ hard work, excluding underrepresented or underprivileged folks, stealing ideas and claiming them as your own for a few examples). More often than not it’s hard to know you’re “doing things right” until you look back. It took me a while to understand how my career led me to joining a Systems Engineering team in particular, because so much of this doubt clouded my career path early on. If you’re thinking of making a similar transition, I’d love to see you take a chance in exploring a new facet in the tech world.

Everyone has their own flavor of falling into engineering. Sometimes it’s sitting next to just the right person to peek over their shoulder as they work. Maybe you have a mom who was really into hardware hacking, passing that same love for electronics over to you as you grew up. A lot of the time it could be “part of the job” - you needed to pay the bills or they trained you to write scripts so you could automate some of the more mundane tasks along the way. All are valid and don’t let anyone tell you otherwise.

I grew up with a hobby centric mindset as my approach into software development. HTML in AOL Pages and writing a blackjack game in BASIC were my gateway drugs, back in the 90’s when it was just starting to become widely accessible to geeks in non-geek families. I had enough knowledge to be dangerous but not enough confidence to push past the fear of failure, even through college. I had a similar reluctance through the first few years of my professional career, a hesitation of parts of the stack that “weren’t for me”. Unless someone asked, I didn’t cross the line.

Taking a bigger step

Often as engineers, we tend to wait for someone to tell us what we can and can’t do. We hesitate to apply for senior roles because we haven’t been called “senior” in a title yet. We hold back on ideas in meetings or questions during architecture reviews worrying that they’re obvious or even stupid (spoiler: they’re not). We don’t question previous design decisions because we assume when it was first built it must have been right and must continue to be right. All of these are intertwined with the fear of being less, doing less, or appearing less in the eyes of those assumed to be smarter or simply more talented than us.

Nothing could be further from the truth.

I don’t believe in an intrinsic trait that breeds systems engineers, or engineers at all for that matter. People are not born and bred for this. Sure, there are folks who are naturally drawn to it and find their mark early in admin work or building distributed systems. That said, no one has database administration in their DNA. There isn’t a gene marker for admins. Noble blood lines delineating who can and can’t be an SRE don’t exist. The field is open to anyone interested, and if you hear someone say differently, they’re only displaying to the world their own insecurities.

Hand in hand with this falsehood is the belief that systems engineering is “harder” than other parts of the stack. Many engineers, myself included, started out building products on the front end typically because the feedback loop is shorter and perhaps arguably more tangible. We convince ourselves that backend work is beyond us because it departs from our comfort zone. Letting go of this self doubt in what you’re capable of opens up a ton of opportunities.

Why join Systems Engineering?

Many of the reasons for which you might have found yourself joining Product Engineering have parallels in Systems Engineering, avenues you might not realize exist. We’re problem solvers and builders, investigators and collaborators. We want to create and to improve, expanding our knowledge of our craft both for ourselves and for others directly and indirectly. If you feel this itch but worry “well, I’m just going to be setting up machines and never coding, right?”, let me allay those fears now. There are a ton of reasons to try your hand here.

Because you care

First and foremost, you care. You’re in this industry because you want to to build products that will be impactful, tools that improve someone’s daily lives or entertain audiences, maybe even save some lives in the process. Systems Engineers deeply believe in that too, just with a small shift in the focus of said audience. People have problems that need solving, ones that you want to put your energy into helping them overcome.

Empathy should be at the core of everything we do in engineering, regardless of role or position in the company. Typically when building a frontend app, your audience consists of folks external to your org. You might meet a few people who say “Oh, you work at X? I love that and use it every day!” which is a great ego boost knowing you’re making people’s lives better. How great is it to get that same feedback from your friends and coworkers? Combining your knowledge of the needs of your frontend with the functional knowledge of what can be done via the backend can make you an important asset to any technical organization. You get do so at a very personal level, one in which you can directly ask your consumers “how can I help?”.

You have a passion for understanding

You’re not content in making assumptions about your stack and you’re voracious in your consumption of material to learn. One of my favorite interview questions is simple in its ask and incredibly deep in its answers: “What happens when you type in your browser and hit enter?”. If you’ve ever walked through that, you’ll realize just how far the branching pathways extend in various directions.

  • Well, something has to interpret a domain name into an IP address, but I’ve always handwaved that (something something DNS). But how does DNS actually work?
  • Is it just one host machine somewhere though? Most likely not. How would a site that takes tens, hundreds, thousands of requests a second scale?
  • Hrm, what if I’m logged in - there has to be datastore requests for information relevant to my account to fetch. How do we reliably read and write to that, and how does it change for read or write heavy applications?
  • How are all of those concurrent requests being spread out? Is there a caching layer in front of them to reduce some of that load?
  • What happens when one machine, multiple machines, all of them, fail in a localized zone?
  • How do I know which static asset versions I should be serving?
  • What’s our deployment strategy for making updates to the site?
  • Is there any kind of security for accessing this site - a cert to confirm I’m visiting the correct site and not being hit by a man in the middle attack? Why should I use TLS over SSL?

And this is all just scratching the barest surface for these and many more questions. If you’re passionate about the learning aspect of engineering, you can see how extending yourself further down into the stack gives you near limitless opportunities to grow as an engineer. When you see companies asking for full stack engineers, it’s folks who are asking these questions and more when presented with a new or unknown architecture.

Sometimes it’s fun just to be the smart engineer who knows a ton of stuff, too. This goes a bit beyond the scope of this post, but there’s a distinct push and pull between doing great work you’re proud of while maintaining the humility that you can’t know everything. You certainly don’t want to be the know-it-all jerk who spouts pedantic trivia (think “well, actually…”), but it’s exciting to be the conduit between teams who can answer a ton of questions. Helping people out can create a strong positive feedback loop for feeling valuable in what you do.

In short, systems engineering can be an opportunity to bust through silos and open up some black boxes. See what assumptions you hold that you can break. There’s nothing magical inside, and yet your colleagues will think you’re a wizard with how much you know!

Pushing back against Defensive Attribution Hypothesis

Defensive Attribution Hypothesis is a cognitive bias that involves the disassociation in the thinking, understanding, and skills of others with those of your own in the face of failure. That blameful feeling you get when something goes awry and you want to point fingers at “those engineers over there”? That comes part and parcel with this. We tend to see failure external to ourselves and mentally push it away further, creating a gulf between perceived success and failure. If failure is way over there, then I must be successful over here. Of course in doing so, you’re also pushing understanding of the situation - and people - away.

The classic example of this is the devs vs. ops mentality. A deploy goes out to production. Shortly after, the site experiences an outage. You can imagine exactly what happens next. The developers cry foul, saying “the code was working in our dev environment, so production must not be set up correctly!”. The ops team says “of course it’s set up correctly, the site was working fine up until that deploy so it must be your change that broke everything. Why didn’t you test it more?”. There’s no insight here, no learning from what happened.

Now, imagine you have experience in both product engineering and systems engineering. You understand what time pressures devs are under to accomplish goals and the purpose of the app. They don’t want to break production, but they’re trying to hit their targets for the latest sprint. Likewise, you know what load the backend can handle and what’s required to scale it further. You know metrics like the requests per second your systems are currently under and how the architecture would need to be scaled up should those numbers change. You can see both sides of the situation.

As an example, let’s say your product team is launching a feature that adds a new query to the database. They build it in dev and it’s reading/writing as expected, but under much reduced load. They followed protocol, putting in the change request to the DBA’s and setting up an index for this query. They’re being proactive! Likewise, your DBA’s gain confidence in the deploy because of this request. The devs are testing it and they wouldn’t have asked for the index addition if they weren’t being careful, so they give the thumbs up. The deploy hits prod and things go south, with both sides believing they had done their due diligence and the other is to blame. A lack of perspective promotes conflict. Now let’s add you to this equation, with your prowess in multiple parts of the stack. You see a code review for this pop into your inbox and think:

  • Perhaps this query could be cached for more availability, since it’s fairly read heavy and can hold up to a bit of staleness without adversely affecting what the app is trying to accomplish
  • Maybe you can reduce the number of columns fetched and use a covering query, because you know what is and isn’t needed for the app
  • To give some confidence to future deploys in general, you could set up a canary cluster for the devs to rely on so that they could see the performance of their code changes before it affects end users

Your knowledge of multiple domains lets you assume best of intentions for all parties involved because you can understand where both parties are coming from and empathize with the requirements they face.

What’s next?

So you’re ready to try your hand at systems engineering be it general ops, database admin, networking, etc., but it’s super intimidating to jump in. You have experience with dev work in some form but the divide looks too far to take in one stride. You need to start fresh in a number of areas, which means in a lot of ways you might feel like you’re starting over. Fortunately, there are lots of ways you can ease into the waters without too much disruption. How do you get from here to there?

Temp rotations with your local ops/backend team

Ask your manager if you can do a short term rotation with your company’s ops team. That tool that everyone wants built but no one has time to get to? This is a great opportunity for both you and your teammates to make it happen. With some coordination, you can be everyone’s favorite engineer putting it together while gaining some real world experience. Likewise, if there are tickets in your Jira, Trello, or other respective task board, see if you can snag one or two that look like low hanging fruit. Your admin friends will be grateful for your help in chopping away at the queue and hopefully can extend some of their domain knowledge to level you up in the process. Everyone wins!

Attend new-to-you meetups, conferences, and meetings

Rotations can be a tall order for some companies, though, and not every org’s roadmap allows for a multi-week trip digression. If your company will sponsor you to go to conferences you might not normally attend, that’s an excellent resource for a longer term investment. Setting up smaller sessions internally within your org to spread domain knowledge as well (demos and “lunch and learns” are two great vehicles for this!) can be immensely useful. This can simultaneously help to break engineers free of being single points of failure for maintaining subsystems and can inform a large swath of folks. Local meetups in your area? Jump on those too! If you’re feeling a bit timid, though, grab a friend to go with, as partnering up can also help the feedback loop for answering questions and stimulating ideas. Sharing information and learning as a team can make those intimidating questions become trivial quickly.

Daily tips and tricks

Impostor syndrome can really hit home when you’re making a large shift like this. Trying to gearshift when you’ve focused so heavily on one vertical in tech is really daunting. DNS? Filesystems? Database integrity? There’s so many paths to choose and each goes so deep. With so many people who have deep concentrations of knowledge it can be intimidating to try to “catch up”.

There’s no need to try to boil the ocean learning all of this in one day. You’re in no rush and you’ve got lots of time in your career no matter where you are looking to pick up skills. You’d probably be surprised at how much you’ve learned in a short time - unless you’re writing out your achievements as they come (and you should! It’s another great way to fight impostor syndrome). There’s a quote attributed to Bill Gates: “Most people overestimate what they can do in one year and underestimate what they can do in ten years.”. We try to do so much in the immediate, but set up a long term plan and you can move mountains. Try working on smaller bite sized projects, courses, and reading that’ll move you along your path.

Here are a few ways to get inspired if you’re looking to attack this problem on multiple fronts:

  • Subscribe to twitter feeds that offer daily tips or regularly contribute to learning. Some of my favorites: Julia Evans (@bork), Command Line Magic (@climagic), and of course SysAdvent in December (@SysAdvent)
  • Some of your favorite sites have great tech blogs - Kickstarter, Yelp, Netflix, and Etsy
  • Subscribe to mailing lists and newsletters with articles that arrive in your mailbox - DevOpsWeekly, SysAdmin Casts, Monitoring Weekly, and SRE Weekly
  • Add some tech podcasts or screencasts to your commute - Arrested DevOps, CodeNewbie Podcast, SysAdmincasts, and Devops Cafe
  • If your company is doing demo days, sit in on some for other teams you don’t typically collaborate with.
  • Likewise, hack weeks hosted by your company can bubble up great ideas and inspire you to venture into new parts of the stack with guidance from their owners.
  • Start a tech book club at your company reading over a chapter every week or two. Learning can be even more effective when you’re sharing ideas with other like minded folks.

Looking just over the horizon

Finally, you can look to parts of the stack adjacent to your comfort zone as a direction into systems engineering. If your strengths lie in mobile app building, ask yourself what might that API architecture it interacts with look like. If you need to set up a datastore call, investigate how you might profile and optimize that query to utilize indexes or set up caching around it. If you’re writing views and controllers in your favorite language (say, php), take a look behind the curtain to see how a dependency management tool, like composer, might be installing packages in various environments.

Picking up new skills doesn’t have to feel like being air dropped into the middle of nowhere. There’s something novel for sure about learning about tools that may be wholly different from where your current strengths lie, but easing into it with tech bordering what you’re comfortable with can smooth that transition. Checking out “over the horizon” tech to see what’s near to what you know but still new can help broaden your skillsets while leaving you with a starting point to build from in your mental model.

Bundling this up

For those of you thinking about a transition to systems engineering work, I can attest from personal experience how rewarding it can be. By opening yourself up, you can be a powerful force for good in your company, one with high adaptability and a wide breadth of scope for promoting positive change. Understanding this can afford you exciting work to be a part of and new challenges to spur your career moving forward. It’s a big step into uncharted territory, but one that can be deeply satisfying.

If you’re restricting yourself to familiar comfort zones or if you have a tendency towards vertical over horizontal learning, make this an opportunity to surprise yourself. There’s no rush - you’re not falling behind by exploring other parts of the stack. The best engineers in the industry have made it because they understand that failure is a necessary risk to achieve personal growth. Yes, you’re going to trip and fall. You did the same getting into engineering in the first place! Trust that you’ll eventually land on your feet and be all the better for it in the end.

December 9, 2017

Day 9 - Using Kubernetes for multi-provider, multi-region batch jobs

By: Eric Sigler (@esigler)
Edited By: Michelle Carroll (@miiiiiche)


At some point you may find yourself wanting to run work on multiple infrastructure providers — for reliability against certain kinds of failures, to take advantage of lower costs in capacity between providers during certain times, or for any other reason specific to your infrastructure. This used to be a very frustrating problem, as you’d be restricted to a “lowest common denominator” set of tools, or have to build up your own infrastructure primitives across multiple providers. With Kubernetes, we have a new, more sophisticated set of tools to apply to this problem.

Today we’re going to walk through how to set up multiple Kubernetes clusters on different infrastructure providers (specifically Google Cloud Platform and Amazon Web Services), and then connect them together using federation. Then we’ll go over how you can submit a batch job task to this infrastructure, and have it run wherever there’s available capacity. Finally, we’ll wrap up with how to clean up from this tutorial.


Unfortunately, there isn’t a one-step “make me a bunch of federated Kubernetes clusters” button. Instead, we’ve got several parts we’ll need to take care of:

  1. Have all of the prerequisites in place.
  2. Create a work cluster in AWS.
  3. Create a work cluster in GCE.
  4. Create a host cluster for the federation control plane in AWS.
  5. Join the work clusters to the federation control plane.
  6. Configure all clusters to correctly process batch jobs.
  7. Submit an example batch job to test everything.


  1. Kubecon is the first week of December, and Kubernetes 1.9.0 is likely to be released the second week of December, which means this tutorial may go stale quickly. I’ll try to call out what is likely to change, but if you’re reading this and it’s any time after December 2017, caveat emptor.
  2. This is not the only way to set up Kubernetes (and federation). One of the two work clusters could be used for the federation control plane, and having a Kubernetes cluster with only one node is bad for reliability. A final example is that kops is a fantastic tool for managing Kubernetes cluster state, but production infrastructure state management often has additional complexity.
  3. All of the various CLI tools involved (gcloud, aws, kube*, and kops) have really useful environment variables and configuration files that can decrease the verbosity needed to execute commands. I’m going to avoid many of those in favor of being more explicit in this tutorial, and initialize the rest at the beginning of the setup.
  4. This tutorial is based off information from the Kubernetes federation documentation and kops Getting Started documentation for AWS and GCE wherever possible. When in doubt, there’s always the source code on GitHub.
  5. The free tiers of each platform won’t cover all the costs of going through this tutorial, and there are instructions at the end for how to clean up so that you shouldn’t incur unplanned expense — but always double check your accounts to be sure!

Setting up federated Kubernetes clusters on AWS and GCE

Part 1: Take care of the prerequisites

  1. Sign up for accounts on AWS and GCE.
  2. Install the AWS Command Line Interface - brew install awscli.
  3. Install the Google Cloud SDK.
  4. Install the Kubernetes command line tools - brew install kubernetes-cli kubectl kops
  5. Install the kubefed binary from the appropriate tarball for your system.
  6. Make sure you have an SSH key, or generate a new one.
  7. Use credentials that have sufficient access to create resources in both AWS and GCE. You can use something like IAM accounts.
  8. Have appropriate domain names registered, and a DNS zone configured, for each provider you’re using (Route53 for AWS, Cloud DNS for GCP). I will use “” below — note that you’ll need to keep track of the appropriate records.

Finally, you’ll need to pick a few unique names in order to run the below steps. Here are the environment variables that you will need to set beforehand:

export S3_BUCKET_NAME="put-your-unique-bucket-name-here"
export GS_BUCKET_NAME="put-your-unique-bucket-name-here"

Part 2: Set up the work cluster in AWS

To begin, you’ll need to set up the persistent storage that kops will use for the AWS work cluster:

aws s3api create-bucket --bucket $S3_BUCKET_NAME

Then, it’s time to create the configuration for the cluster:

kops create cluster \
 --name="" \
 --dns-zone="" \
 --zones="us-east-1a" \
 --master-size="t2.medium" \
 --node-size="t2.medium" \
 --node-count="1" \
 --state="s3://$S3_BUCKET_NAME" \
 --kubernetes-version="1.8.0" \

If you want to review the configuration, use kops edit cluster --state="s3://$S3_BUCKET_NAME". When you’re ready to proceed, provision the AWS work cluster by running:

kops update cluster --yes --state="s3://$S3_BUCKET_NAME"

Wait until kubectl get nodes --show-labels shows the NODE role as Ready (it should take 3–5 minutes). Congratulations, you have your first (of three) Kubernetes clusters ready!

Part 3: Set up the work cluster in GCE

OK, now we’re going to do a very similar set of steps for our second work cluster, this one on GCE. First though, we need to have a few extra environment variables set:

export PROJECT=`gcloud config get-value project`

As the documentation points out, using kops with GCE is still considered alpha. To keep each cluster using vendor-specific tools, let’s set up state storage for the GCE work cluster using Google Storage:

gsutil mb gs://$GS_BUCKET_NAME/

Now it’s time to generate the configuration for the GCE work cluster:

kops create cluster \
 --name="" \
 --dns-zone="" \
 --zones="us-east1-b" \
 --state="gs://$GS_BUCKET_NAME/" \
 --project="$PROJECT" \
 --kubernetes-version="1.8.0" \

As before, use kops edit cluster --state="gs://$GS_BUCKET_NAME/" to peruse the configuration. When ready, provision the GCE work cluster by running:

kops update cluster --yes --state="gs://$GS_BUCKET_NAME/"

And once kubectl get nodes --show-labels shows the NODE role as Ready, your second work cluster is complete!

Part 4: Set up the host cluster

It’s useful to have a separate cluster that hosts the federation control plane. In production, it’s better to have this isolation to be able to reason about failure modes for different components. In the context of this tutorial, it’s easier to reason about which cluster is doing what work.

In this case, we can use the existing S3 bucket we’ve previously created to hold the configuration for our second AWS cluster — no additional S3 bucket needed! Let’s generate the configuration for the host cluster, which will run the federation control plane:

kops create cluster \
 --name="" \
 --dns-zone="" \
 --zones=us-east-1b \
 --master-size="t2.medium" \
 --node-size="t2.medium" \
 --node-count="1" \
 --state="s3://$S3_BUCKET_NAME" \
 --kubernetes-version="1.8.0" \

Once you’re ready, run this command to provision the cluster:

kops update cluster --yes --state="s3://$S3_BUCKET_NAME"

And one last time, wait until kubectl get nodes --show-labels shows the NODE role as Ready.

Part 5: Set up the federation control plane

Now that we have all of the pieces we need to do work across multiple providers, let’s connect them together using federation. First, add aliases for each of the clusters:

kubectl config set-context aws
kubectl config set-context gcp
kubectl config set-context host

Next up, we use the kubefed command to initialize the control plane, and add itself a member:

kubectl config use-context host
kubefed init fed --host-cluster-context=host --dns-provider=aws-route53 --dns-zone-name=""

If the message “Waiting for federation control plane to come up” takes an unreasonably long amount of time to appear, you can check the underlying pods for any issues by running:

kubectl get all --namespace=federation-system
kubectl describe po/fed-controller-manager-EXAMPLE-ID --namespace=federation-system

Once you see “Federation API server is running,” we can join the work clusters to the federation control plane:

kubectl config use-context fed
kubefed join aws --host-cluster-context=host --cluster-context=aws
kubefed join gcp --host-cluster-context=host --cluster-context=gcp
kubectl --context=fed create namespace default

To confirm everything’s working, you should see the aws and gcp clusters when you run:

kubectl --context=fed get clusters

Part 6: Set up the batch job API

(Note: This is likely to change as Kubernetes evolves — this was tested on 1.8.0.) We’ll need to edit the federation API server in the control plane, and enable the batch job API. First, let’s edit the deployment for the fed-apiserver:

kubectl --context=host --namespace=federation-system edit deploy/fed-apiserver

And within the configuration in the federation-apiserver section, add a –runtime-config=batch/v1 line, like so:

  - command:
    - /hyperkube
    - federation-apiserver
    - --admission-control=NamespaceLifecycle
    - --bind-address=
    - --client-ca-file=/etc/federation/apiserver/ca.crt
    - --etcd-servers=http://localhost:2379
    - --secure-port=8443
    - --tls-cert-file=/etc/federation/apiserver/server.crt
    - --tls-private-key-file=/etc/federation/apiserver/server.key
  ... Add the line:
    - --runtime-config=batch/v1

Then restart the Federation API Server and Cluster Manager pods by rebooting the node running them. Watch kubectl get all --context=host --namespace=federation-system if you want to see the various components change state. You can verify the change applied by running the following Python code:

# Sample code from Kubernetes Python client
from kubernetes import client, config

def main():

    print("Supported APIs (* is preferred version):")
    print("%-20s %s" %
          ("core", ",".join(client.CoreApi().get_api_versions().versions)))
    for api in client.ApisApi().get_api_versions().groups:
        versions = []
        for v in api.versions:
            name = ""
            if v.version == api.preferred_version.version and len(
                    api.versions) > 1:
                name += "*"
            name += v.version
        print("%-40s %s" % (, ",".join(versions)))

if __name__ == '__main__':

You should see output from that Python script that looks something like:

> python
Supported APIs (* is preferred version):
core                 v1
federation           v1beta1
extensions           v1beta1
batch                v1

Part 7: Submitting an example job

Following along from the Kubernetes batch job documentation, create a file, pi.yaml with the following contents:

apiVersion: batch/v1
kind: Job
  generateName: pi-
      name: pi
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4

This job spec:

  • Runs a single container to generate the first 2,000 digits of Pi.
  • Uses a generateName, so you can submit it multiple times (each time it will have a different name).
  • Sets restartPolicy: Never, but OnFailure is also allowed for batch jobs.
  • Sets backoffLimit. This generates a parse violation in 1.8, so we have to disable validation.

Now you can submit the job, and follow it across your federated set of Kubernetes clusters. First, at the federated control plane level, submit and see which work cluster it lands on:

kubectl --validate=false --context=fed create -f ./pi.yaml 
kubectl --context=fed get jobs
kubectl --context=fed describe job/<JOB IDENTIFIER>

Then (assuming it’s the AWS cluster — if not, switch the context below), dive in deeper to see how the job finished:

kubectl --context=aws get jobs
kubectl --context=aws describe job/<JOB IDENTIFIER>
kubectl --context=aws get pods
kubectl --context=aws describe pod/<POD IDENTIFIER>
kubectl --context=aws logs <POD IDENTIFIER>

If all went well, you should see the output from the job. Congratulations!

Cleaning up

Once you’re done trying out this demonstration cluster, you can clean up all of the resources you created by running:

kops delete cluster --yes --state="gs://$GS_BUCKET_NAME/"
kops delete cluster --yes --state="s3://$S3_BUCKET_NAME"
kops delete cluster --yes --state="s3://$S3_BUCKET_NAME"

Don’t forget to verify in the AWS and GCE console that everything was removed, to avoid any unexpected expenses.


Kubernetes provides a tremendous amount of infrastructure flexibility to everyone involved in developing and operating software. There are many different applications for federated Kubernetes clusters, including:

Good luck to you in whatever your Kubernetes design patterns may be, and happy SysAdvent!

December 8, 2017

Day 8 - Breaking in a New Company as an SRE

By: Amy Tobey (@AlTobey)
Edited By: Tom Purl (@tompurl)

To put it mildly, 2017 has been an interesting year for everyone. My story has its own bumps and advances, including changing jobs twice recently. Changing jobs is a reality many of us face, hopefully voluntarily. It happens infrequently enough that it isn’t something we spend a lot of time talking about. Life’s trials tend to teach us quickly and I hope to share some of what I’ve learned with you.


I started the year as an SRE on the CORE team at Netflix. For various reasons, mostly involving stress, I ended up being fired in late June. I only took a couple days to cool off before kicking off interviews. I knew my financial situation would be stable because of the severance package, but I still wanted to have something lined up as soon as possible. Interviewing can sometimes take many weeks or even months so I got started right away.

My first stop was Apple. I knew lots of engineers from my Cassandra days and a friend there had previously pinged me to see if I was interested. I let them know I was available and that process took off.

I also started talking to two other large tech companies, eventually withdrawing from both. One was too far away and had a lengthy study guide for interviews that I was not willing to commit to. The other was going to make an offer but I found their interview process too heavily tilted towards technical ability with little to no discussion of culture or the day-to-day work.

Finally, I interviewed with two smaller infosec companies. On one call we mutually recognized there was no fit while the other moved on to make me a compelling offer.

I eventually went with the Apple offer as it paid more and I knew more people. The difference I didn’t adequately account for was the commute; the security company is remote and Apple involved a long shuttle commute in bay area traffic.

I was only at Apple for 6 weeks and during that time I was trying to experience it with a beginner’s mind. I read docs, code, and listened a lot. In the end, it wasn’t a culture fit for me and the commute was unbearable at 1-2 hours each way, so I quit.

The day after quitting, I set up a call with Tenable and they graciously renewed their offer to me which I accepted immediately. This time I took a couple weeks off because I really wanted to give them my best start.

Day Zero

Since my job is remote, they mailed me my gear. It arrived the Friday before my first day and I took inventory and set it all aside. Some of my accounts had been activated early, but I resisted the urge to start nesting before I was official.

Week 1

I found some bugs with the Windows 10 installation on my laptop on the first day. I spent far too much time dinking with it and finally emailed my managers and IT and got it sorted out within a few hours. I still ended up installing Linux because I found some things about the standard image too constricting.

Besides messing with my laptop, I had a few meetings and spent a lot of time reading docs and code. I spent the remainder getting to know folks over chat and video calls.


  • Start using collective speech immediately, i.e. “ours” instead of “yours”.
  • Reach out for help with IT problems more quickly, even if the issue at hand appears user-serviceable.
  • Reading docs and code in downtime was a great idea.
  • Spending time keeping up with chat rooms is a good way to get to know people in a remote job.
  • Keep notes about things that are troublesome like my laptop issues and bumps in setting up accounts (more on this later).
  • Generate 4KB ssh keys with a good password to impress your colleagues when they copy/past your pubkey.
  • Install Linux sooner :)

Week 2

With a working laptop and most of my accounts set up, it was time to start digging in. I got set up on the bastions and started looking at some tickets for our databases. Getting into live machines for analysis helped expose gaps in the new-hire onboarding workflow. Where possible I updated docs and other things were added to my notes.


  • Find a ticket to work on right away and give yourself plenty of time. I hit a few snags with access and local conventions that slowed me down.
  • Try to figure things out by searching docs and poking.
  • Start committing small changes to the team’s primary repos. e.g. formatting, spelling, tests, and minor bug fixes.
  • Set a time limit on self-discovery and ask for help.
  • I don’t always take notes, but when I do, it’s usually during my first month on the job.

Week 3

For my third week, the company flew me to headquarters for the official orientation. The first day of company overview, culture introduction, and getting to know the other new hires was enjoyable and interesting. The next couple days were largely sales-focused. My manager and one of my peers flew in and sprung me from those days. We went over our ongoing projects and prioritized work while we got a feel for each other. On the last day of the week I started researching projects we prioritized and started coordinating with other SREs on how to approach things.


  • Get to know one or two people in your hiring “class”. It’s fun and you’ll probably run into these folks again and again regardless of which department each of you are in.
  • Ask about orientation and find out if you really need to attend all of it. At the SaaS companies I’ve experienced there is a lot of content for sales folks (often much higher turnover than other departments) that might not be very relevant to an SRE.
  • Remember to install your password manager on your travel device (unless international).

Week 4

In my reviews of the Cassandra setup I found a few things that needed addressing. Rather than handing off the work that needed to be done, I decided to take it on a little before I felt ready. This gave me the opportunity to start committing to our Ansible repos. I wrote a bit of shell code and updated a bunch of playbooks that led me through learning the basics I needed to work with Ansible daily. I got one project done that was blocking progress for dev teams which led to making friends in those teams as I worked with them.

By this time I was starting to get questions from managers and peers about how I was liking the company and people. I was happy to reply in the positive and kept mental notes about specific things people asked.


  • Find a project in your wheelhouse to start with. For example, I knew Cassandra or Linux tuning would be a good place for me. Find some familiar ground and use it to launch yourself forward.
  • Clearing blocker tickets is an excellent way to make friends in dev teams.
  • Remember to give yourself extra time to adapt to the new environment.

Beginner’s Mind

A few years ago when I was stressing out about a talk, Aaron Morton recommended that I read Zen Mind, Beginner’s Mind. It’s an excellent book if you’re interested in Zen of course, but one thing I took away from it is the value of seeing things with a beginner’s mind. Once we’ve acclaimated to a new environment - be it a company, home, city, or vehicle - we are less likely to notice things that are obvious to beginners. I took my notes and data spanning git logs, email, Jira, and even my shell history to create a document with all the things I noticed that were good and those that needed improvement. I chose to keep things as constructive and positive as I could.


  • Pointed out some technical debt, noting that it was not bad and totally fixable.
  • Wrote a quick review of the state of our Ansible repos and procedures.
  • Noted some good patterns I noticed in our AWS setup, particularly that we were leveraging IAM and VPC quite well for being relatively new to cloud ops.
  • Included a detailed review of the Cassandra clusters since that’s an area of expertise for me.
  • Pointed out gaps in operational tooling that we should focus on.
  • Gushed about how I saw the company’s stated culture show up in my interactions with people as I onboarded.
  • Wrote a bit about my experience with our product and threw out my BHAG for it.
  • Recommended improvements for gear & provisioning specific to technical staff.
  • Called out the need to simplify our edge architecture (under way!).
  • Talked briefly about my orientation experience and recommended some changes for engineering hires.
  • Linked some research about IT procedures that needed to updated.
  • Celebrated that folks were using my name and pronouns consistently.
  • Offered to advise IT folks on some improvements to laptop deploys.

I wrote the doc with my manager, their manager, and my peers in mind. The last I heard it was being passed around the C-levels with positive remarks. Even though I resolved some blockers and deployed a significant upgrade, this document was by far my largest contribution to date. The people I joined were used to their environment and by way of keeping some notes and adding some narrative, I was able to reach a large chunk of the management team and have an impact that benefitted the whole organization.


Starting a new job is stressful regardless of your role. For those of us working in operations and site reliability, the tech work is usually the easy part. Getting acclaimated into the social fabric and pacing yourself for the long haul is something I think we all struggle with. By being methodical and mindful about your approach to your first few weeks, it’s possible to make an impact without the usual stress of being new and vulnerable.

As with any tips of this sort, your experience may vary. Some things that worked for me may also work for you. Above all, find what works for you, make a plan, and communicate, communicate, communicate.

December 7, 2017

Day 7 - Running InSpec as a Push Job…or…The Nightmare Before Christmas

By: Annie Hedgepath (@anniehedgie) Edited By: Jan Ivar Beddari (@beddari)


Bored with his Halloween routine, Jack Skellington longs to spread Christmas joy, but his antics put Santa and the holiday in jeopardy! - Disney


I feel a kindred spirit with Jack Skellington. I, too, wanted to spread some holiday-InSpec joy with my client, but the antics of their air-gapped environment almost put InSpec and my holiday joy in jeopardy. All my client wanted for Christmas was to be able to run my InSpec profile in the Jenkins pipeline to validate configuration of their nodes, and I was eager to give that to them.

Sit back and let me tell the holiday tale of how I had no other choice but to use Chef push jobs to run InSpec in an air-gapped environment and why it almost ruined Christmas.

Nothing would have brought me more holiday cheer than to be able to run run the tests as a winrm or ssh command from the Jenkins server directly from a profile in a git repository, not checked out. However, my soul sank as I uncovered reason after reason for the lack of joy for the season:

Scroogey Problems:

  1. Network Connectivity: The nodes are in an air-gapped environment, and we needed InSpec to run every time a node was added.
  2. Jumpbox Not an Option: I could have PowerShell remoted into the jumpbox and run my InSpec command remotely, but this was, again, not an option for me. You see, my InSpec profile required an attributes file. An attribute is a specific detail about a node, so I had to create an attributes file for my InSpec profile by using a template resource in the Chef cookbook. This is because I needed node attributes and data-bag information for my tests that were specific to that particular Chef environment.
  3. SSL Verification: There is an SSL error when trying to access the git repo in order to run the InSpec profile remotely. Chef is working on a feature to disable SSL verification. When that is ready, we can access InSpec via a git link but not now.

Because we were already using push jobs for other tasks, I finally succumbed to the idea that I would need to run my InSpec profiles as ::sigh:: push jobs.

Let me tell you real quickly what a push job is. Basically, you run a cookbook on your node that allows you to push a job from the Chef server onto your node. When you run the push jobs cookbook, you define that job with a simple name like “inspec” and what it does, for example: inspec exec .. Then you run that job with a knife command, like knife job start inspec mynodename.

Easy, right? Don’t get so cheery yet.

This was the high level of what would have to happen, or what you might call the top-of-the-Christmas-tree view:

  1. The InSpec profile is updated.
  2. The InSpec profile is zipped up into a .tar.gz file using inspec archive [path] and placed in the files/default folder of a wrapper cookbook to the cookbook that we were testing. The good thing about using the archive command is that it versions your profile in the file name.
  3. The wrapper cookbook with the new version of the zipped up InSpec profile is uploaded to the Chef server.
  4. Jenkins runs the wrapper cookbook as a push job when a new node is added, and the zipped up InSpec profile is added to the node using Chef’s file resource.
  5. During the cookbook run, an attributes file is created from a template file for the InSpec profile to consume. The push jobs cookbook has a whitelist attribute to which you add your push job. You’re just telling Chef that it’s okay to run this job. Because my InSpec command was different each time due to the version of the InSpec profile, I had to create basically make the command into a variable, so that meant I had to nest my attributes, which looks like this:
    node['push_jobs']['whitelist'] = {
     'chef-client' => 'chef-client',
     'inspec' => node['mycookbook']['inspec_command']

    The inspec_command attribute was defined like like this (more nesting):

    "C:/opscode/chef/embedded/bin/inspec exec #{Chef::Config[:file_cache_path]}/cookbooks/mycookbook/files/default/mycookbook-inspec-#{default['mycookbook']['inspec_profile_version']}.tar.gz --attrs #{default['mycookbook']['inspec_attributes_path']}"
  6. Another Jenkins stage is added that runs the “inspec” push job.

And all of that needs to be automated so that it actually stays updated. Yay…

I will not get into the details of automating this process, but here is the basic idea. It is necessary to leverage a build that is kicked off in Jenkins by a pull request made in git. That build, which is a Jenksinsfile in my InSpec profile, does this: - archives the profile after it merges into master - checks out the wrapper cookbook and creates a branch - adds to new version of the profile to the files/default directory - updates the InSpec profile version number in the attributes file - makes a pull request to the wrapper cookbook’s master branch that also has a pull request build which ensures that Test Kitchen passes before it is merged


So…this works, but it’s not fun at all. It’s definitely the Nightmare Before Christmas and the Grinch Who Stole Christmas wrapped up into one. It takes a few plugins in both Jenkins and BitBucket, which can be difficult to pull off if you don’t have admin rights. I used this blog post as a reference.

I battled internally with a simpler way to do this. A couple of nice alternatives could have been Saltstack and Chef Automate, but neither of those were an option for me. I’m not familiar with Saltstack, but I’m told that its remote execution feature would be able to run InSpec in an air-gapped environment. Likewise, Chef Automate has the Chef Compliance feature which runs all of your InSpec profiles from the Compliance server that you can put in your network. I’m still on the fence about whether those would have been easier to implement, though, because of the heavy dependence I had on the node attributes and data-bags that are stored on the Chef server.

As ugly as this process is, every time I see those all successful test results displayed in the Jenkins output, I can’t help but put a big ol' jolly smile on my face. Sure, it super sucks to jump through all these hoops to get InSpec to work in this environment, but it when the automation works, it just works and no one knows what I had to go through to get it there. It’s like a Christmas miracle.


Do I recommend doing it this way if you don’t have to? No. Is this a great workaround if you have no other way to validate your configuration? Absolutely.

And if you need further convincing of the Christmas magic of InSpec, be sure to read my post last year about how InSpec builds empathy across organizations.

I hope you enjoyed my post! Many special thanks to Jan Ivar Beddari for editing this post and to Chris Webber for organizing this very merry blog for all of us! You can follow me on Twitter @anniehedgie. If you’d like to read more about InSpec, I wrote a whole tutorial series for you to follow here. And if you’d like me and my team at 10th Magnitude to help you out with all things Azure, give us a shout!

December 6, 2017

Day 6 - sysadmins - the evolution of a role amidst revolutionary hype.

By: Robert Treat (@robtreat2)

Edited By: Daniel “phrawzty” Maher (@phrawzty)

Like so many things in our industry, our job titles have become victims of the never-ending hype cycle. While the ideas behind terms like “DevOps” or “Site Reliability Engineering” are certainly valid, over time the ideas get lost and the terms become little more than buzzwords, amplified by a recruiting industry more concerned about their own short-term paychecks than our long-term career journeys. As jobs and workers are rebranded as DevOps and SREs, many Ssysadmins are left wondering if they are being left behind. Add in a heavy dose of the cloud, and a sysadmin has to wonder whether they will have a job in a few years. The good news is that despite all the noise, the need for sysadmins has never been stronger, you just need to see the connections between the technology you grew up on, and the technology that is going to move us forward in the next several years.

It used to be that when you started a new project you first had to determine what hardware to use, both from a physical standpoint but also from a pricing standpoint. In the cloud, these concerns are still there but have shifted. While most people in the cloud no longer worry about sizing infrastructure correctly at the start of a project, the tradeoff of being able to re-size VMs with relative ease is that it is also easy to oversize your instances, and in the cloud oversize means overspend, all day every day; at some point you need to work out the math to determine what your growth curve looks like in terms of resource needs vs dollars spent. Ever made that joke about having to bust out the slide rule to run cost comparisons between Riverbed, NetApp, and LSI? As much as they try, the cloud hasn’t made IOPS go away. Helping set estimates on how many IOPS an application will consume still requires a bit of maths, only now you also need to know your way around EBS and SSDs vs Provisioned IOPS in order to determine a reasonable IOPS per dollar ratio. But hey, you’ve done that before.

And that’s the thing; there are many skills which transfer like that. Scared about the new world of microservices? Don’t be - you were dealing with microservice like architectures long before the rest of us had even heard of the term. At it’s core, microservices are just a collection of loosely coupled services designed to meet some set of business goals. While we have not traditionally built and run applications that way, the mental leap for a sysadmin familiar with managing machines running Apache, Sendmail, OpenLDAP, and Squid is much less than for a developer who has only ever dealt with building complex monolithic applications. As sysadmins, we don’t think twice about individual services running on different ports, speaking different protocols, and providing completely different methods for observing their behavior; that’s just the way it is. Compare that to a development community that has wasted a generation trying to build ORMs to abstract away the concept of data storage rather than just learning to talk to another service in its own language.

This isn’t to say you can rest on your laurels. The field of Web Operations and the software that powers it is constantly changing, so you need to develop the ability to take what you already know and apply it to new software and systems. It is worth pointing out that this won’t always be easy or clean; new technology often misrepresents itself. For example, the rise of tools like Chef and Docker left many sysadmins wondering which direction to turn, but if you study these tools for a bit, you see that they draw similar patterns to old techniques. It can certainly be difficult for folks who have spent years coding on the command line to grok the syntax of a configuration management tools DSL, but you can certainly understand why companies want to automate systems; the idea of replacing manual labor with something automated is something we print on t-shirts just for fun. And sure, I understand how the yarn ball of recipes, resources, and roles might look like overkill to a lot of people, but I’ve also seen some crazy complex mixes of bash and Perl used as startup scripts during a PXE boot, so it’s all relative.

When Docker first came on the scene, it also promised to revolutionize everything we know about managing systems. All the hype around containers seemed to focus on resource management and security, but the reality was mostly just a new way to package and distribute software, whereby new I mostly just mean different. I’ve often lamented that something is lost when an operator never learns how to compile code or bypasses the experience of packaging other people’s software. The promise of Docker is that you can put together systems like using a set of legos using pre-existing pieces, but stray from the well trodden path even a little and you’ll find all of those magic compile time errors and strange library dependencies that you are familiar with from systems in the past. Like any system built on abstractions, there are assumptions (and therefore dependencies) baked in three levels deep. If you ever debated whether to rewrite an rpm spec file from scratch after a half day hacking on the distro’s spec file trying to add in the one module you need the maintainers didn't… replace rpm spec file with dockerfile and you have someone to share root beers with. Sure the container magic is magic when it works, but the devil is in the dependencies.

Of course no conversation about the role of the sysadmin would be complete without touching on the topics of networks and security. While sometimes made the purview of dedicated personnel, at some level these two areas always seem to fall back to the operations team. Understanding the different networks within your organization, the boundaries between those networks, and the who or how to traverse them has always been a part of life as a sysadmin. Unfortunately in what should be one of the most directly applicable skillsets (networks are still networks), the current situation in cloud land has actually gotten worse; the stakes are fundamentally higher in a world where the public internet is always just a mis-configuration away. Incorrect permissions on a network file share might expose sensitive material to everyone in the company, but incorrect permissions in S3 expose those files to the world. Networking is also more complicated in the cloud world. I’ve always been pretty impressed by how far one could get with VPNs and SSH when building your own, but with cloud providers focused on attracting enterprise clients in regulated industry, you’ll have to learn new tooling built to meet those needs, for better or worse. It can still be done, just be aware it is going to work a little differently.

So the good news is that the role of the sysadmin isn’t going away. While the specifics may have changed, resource management, automation, packaging, network management, security, rapid response, and all the other parts of the sysadmin ethos remain critical skills that companies will need going forward. Unfortunately that doesn’t solve the problem of companies thinking they need to hire more “DevOps Developers” (though I never see jobs for “DevOps Operators” - go figure!) and other such crazy things. As I see it, it is easy to make fun of companies who want to hire DevOps engineers because they don’t understand DevOps, but you can also look at it like hiring a network engineer or security engineer - believing you need someone who specializes in automation (and likely CM or CI/CD specifically) is not necessarily a bad thing. So the next time you’re feeling lost or wondering where your next journey may take you, remember that if you focus on the principles of running systems reliably and keep your learning focused on fundamental computing skills, even though the tools may change the problems are fundamentally the same.

December 5, 2017

Day 5 - Do you want to build a helm chart?

By: Paul Czarkowski (@pczarkowski)

Edited By: Paul Stack (@stack72)

Kubernetes Kubernetes Kubernetes is the new Docker Docker Docker

“Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.”

Remember way back in 2015 when all anybody would talk about was Docker even though nobody actually knew what it was or what to do with it? That’s where Kubernetes is right now. Before we get into Helm charts, a quick primer on Kubernetes is a good idea.

Kubernetes provides scheduling and management for your containerized applications as well as the networking and other necessary plumbing and surfaces its resources to the developer in the form of declarative manifests written in YAML or JSON.

A Pod is the smallest deployable unit that can be deployed with Kubernetes. It contains one or more collocated containers that share the same [internal] IP address. Generally a pod is just a single container, but if your application requires a sidecar container to share the same IP or a shared volume then you would declare multiple containers in the same pod.

A Pod is unmanaged and will not be recreated if the process inside the container ends. Kubernetes has a number of resources that build upon a pod to provide different types of lifecycle management such as a Deployment that will ensure the correct number of replicas of your Pod are running.

A Service provides a stable name, IP address, and DNS (if KubeDNS is enabled) across a number of pods and acts as a basic load balancer. This allows pods to easily communicate to each other and can also provide a way for Kubernetes to expose your pod externally.

Helm is a Package Manager for Kubernetes. It doesn’t package the containers (or Pods) directly but instead packages the kubernetes manifests used to build those Pods. It also provides a templating engine that allows you to deploy your application in a number of different scenarios and configurations.

[Helm] Charts are easy to create, version, share, and publish — so start using Helm and stop the copy-and-paste madness. –

Let’s Build a Helm Chart!

In order to follow along with this tutorial you will need to install the following:

If you are on a Mac you should be able to use the following to install the necessary bits:

$ brew cask install minikube
$ brew install kubernetes-helm

If you already have a Kubernetes manifest its very easy to turn it into a Helm Chart that you can then iterate over and improve as you need to add more flexibility to it. In fact your first iteration of a Helm chart can be your existing manifests tied together with a simple Chart.yaml file.

Prepare Environment

Bring up a test Kubernetes environment using Minikube:

$ minikube start
Starting local Kubernetes v1.7.5 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.

Wait a minute or so and then install Helm’s tiller service to Kubernetes:

$ helm init
$HELM_HOME has been configured at /home/XXXX/.helm.

Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.
Happy Helming!

If it fails out you may need to wait a few more minutes for minikube to become accessible.

Create a path to work in:

$ mkdir -p ~/development/my-first-helm-chart

$ cd ~/development/my-first-helm-chart

Create Example Kubernetes Manifest.

Writing a Helm Chart is easier when you’re starting with an existing set of Kubernetes manifests. One of the easiest ways to get a basic working manifest is to ask Kubernetes to run something and output the resultant manifest to a file.

Run a basic nginx Deployment and expose it via a NodePort Service:

$ mkdir -p templates

$ kubectl run example --image=nginx:alpine \
    -o yaml > templates/deployment.yaml
$ kubectl expose deployment example --port=80 --type=NodePort \
    -o yaml > templates/service.yaml

Minikube has some helper functions to let you easily find the URL of your service. run curl against your service to ensure that its running as expected:

$ minikube service example --url

$ curl $(minikube service example --url)
<title>Welcome to nginx!</title>

You’ll see you now have two Kubernetes manifests saved. We can use these to bootstrap our helm charts:

$ tree
└── templates
    ├── deployment.yaml
    └── service.yaml

Explore the deployment.yaml file in a text editor. Following is an abbreviated version of it with comments to help you understand what some of the sections mean:

# These first two lines appear in every Kubernetes manifest and provide
# a way to declare the type of resource and the version of the API to
# interact with.
apiVersion: apps/v1beta1
kind: Deployment
# under metadata you set the resource's name and can assign labels to it
# these labels can be used to tie resources together. In the service.yaml
# file you'll see it refers back to this `run: example` label.
    run: example
  name: example
# how many replicas of the declared pod should I run ?
  replicas: 1
      run: example
# the Pod that the Deployment will manage the lifecycle of.
# You can see once again the use of the label and the containers
# to run as part of the pod.
        run: example
      - image: nginx:alpine
        imagePullPolicy: IfNotPresent
        name: example

Explore the service.yaml file in a text editor. Following is an abbreviated version of it:

apiVersion: v1
kind: Service
    run: example
  name: example
# The clusterIP is the IP address that other nodes can use to access the pods
# Since we didn't specify and IP Kubernetes picked one for us.
# The Port mappings for the service.
  - nodePort: 32587
    port: 80
    protocol: TCP
    targetPort: 80
# Any pods that have this label will be exposed by this service.    
    run: example
# All Kubernetes worker nodes will expose this service to the outside world
# on the port specified above as `nodePort`.
  type: NodePort

Delete the resources you just created so that you can move on to creating the Helm Chart:

$ kubectl delete service,deployment example
service "example" deleted
deployment "example" deleted

Create and Deploy a Basic Helm Chart

The minimum set of things needed for a valid helm chart is a set of templates (which we just created) and a Chart.yaml file which we need to create.

Copy and paste the following into your text editor of choice and save it as Chart.yaml:

Note: the file should be capitalized as shown above in order for Helm to use it correctly.

apiVersion: v1
description: My First Helm Chart
name: my-first-helm-chart
version: 0.1.0

We now have the the most basic Helm Chart possible:

$ tree
├── Chart.yaml
└── templates
    ├── deployment.yaml
    └── service.yaml

Next you should be able to install this helm chart giving it a release name of first and using the current directory as the source of the Helm Chart:

$ helm install -n example .
NAME:   example
LAST DEPLOYED: Wed Nov 22 10:55:11 2017
NAMESPACE: default

==> v1beta1/Deployment
example  1        1        1           0          1s

==> v1/Service
example   <nodes>      80:32587/TCP  1s

Just as you did earlier you can use minikube to get the URL:

$ curl $(minikube service example --url)
<title>Welcome to nginx!</title>

Congratulations! You’ve just created and deployed your first Helm chart. However its a little bit basic, the next step is to add some templating to the manifests and update the deployment.

Add variables to your Helm Chart

In order to render templates you need a set of variables. Helm charts can come with a values.yaml file which declares a set of variables and their default values that can be used in your templates. Create a values.yaml file that looks like this:

replicaCount: 2
image: "nginx:alpine"

These values can be accessed in the templates using the golang templating engine. For example the value replicaCount would be written as {{ .Values.replicaCount }}. Helm also provides information about the Chart and Release that can be handy to utilize.

Update your templates/deployment.yaml to utilize our values:

apiVersion: apps/v1beta1
kind: Deployment
    run: "{{ .Release.Name }}"
    chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
    release: "{{ .Release.Name }}"
  name: "{{ .Release.Name }}"
  namespace: default
  replicas: {{ .Values.replicaCount }}
      run: "{{ .Release.Name }}"
        run: "{{ .Release.Name }}"
      - image: "{{ .Values.image }}"
        name: "{{ .Release.Name }}"

Edit your templates/service.yaml to look like:

apiVersion: v1
kind: Service
  name: "{{ .Release.Name }}"
    run: "{{ .Release.Name }}"
    chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
    release: "{{ .Release.Name }}"
  - port: 80
    protocol: TCP
    targetPort: 80
    run: "{{ .Release.Name }}"
  type: NodePort

Once your files are written out you should be able to update your deployment:

$ helm upgrade example .
Release "example" has been upgraded. Happy Helming!
LAST DEPLOYED: Wed Nov 22 11:12:25 2017
NAMESPACE: default

==> v1/Service
example   <nodes>      80:31664/TCP  14s

==> v1beta1/Deployment
example  2        2        2           2          14s

You’ll notice that your Deployment now shows as having two replicas of your pod demonstrating that the replicas value provided has been applied:

$ kubectl get deployments
example   2         2         2            2           2m

$ kubectl get pods       
NAME                       READY     STATUS    RESTARTS   AGE
example-5c794cbb55-cvn4k   1/1       Running   0          2m
example-5c794cbb55-dc7gf   1/1       Running   0          2m

$ curl $(minikube service example --url)
<title>Welcome to nginx!</title>

You can override values on the command line when you install (or upgrade) a Release of your Helm Chart. Create a new release of your helm chart setting the image to apache instead of nginx:

$ helm install -n apache . --set image=httpd:alpine --set replicaCount=3
NAME:   apache
LAST DEPLOYED: Wed Nov 22 11:20:06 2017
NAMESPACE: default

==> v1beta1/Deployment
apache  3        3        3           0          0s

==> v1/Service
apache  <nodes>      80:30841/TCP  0s

Kubernetes will now show two sets of Deployments and Services and their corresponding pods:

$ kubectl get svc,deployment,pod                                        
NAME             TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)        AGE
svc/apache       NodePort   <none>        80:30841/TCP   1m
svc/example      NodePort    <none>        80:31664/TCP   8m
svc/kubernetes   ClusterIP     <none>        443/TCP        58m

deploy/apache    3         3         3            3           1m
deploy/example   2         2         2            2           8m

NAME                          READY     STATUS    RESTARTS   AGE
po/apache-5dc6dcd8b5-2xmpn    1/1       Running   0          1m
po/apache-5dc6dcd8b5-4kkt7    1/1       Running   0          1m
po/apache-5dc6dcd8b5-d2pvt    1/1       Running   0          1m
po/example-5c794cbb55-cvn4k   1/1       Running   0          8m
po/example-5c794cbb55-dc7gf   1/1       Running   0          8m

By templating the manifests earlier to use the Helm release name in the labels for the Kubernetes resources the Services for each release will only talk to its corresponding Deployments:

$ curl $(minikube service example --url)
<title>Welcome to nginx!</title>

$ curl $(minikube service apache --url)
<html><body><h1>It works!</h1></body></html>

Clean Up

Delete your helm deployments:

$ helm delete example --purge          
release "example" deleted

$ helm delete apache --purge          
release "apache" deleted

$ minikube delete
Deleting local Kubernetes cluster...
Machine deleted.


Congratulations you have deployed a Kubernetes cluster on your laptop using minikube and deployed a basic application to Kubernetes by creating a Deployment and a Service. You have also built your very first Helm chart and used the Helm templating engine to deploy different versions of the application.

Helm is a very powerful way to package up your Kubernetes manifests to make them extensible and portable. While it is quite complicated its fairly easy to get started with it and if you’re like me you’ll find yourself replacing the Kubernetes manifests in your code repos with Helm Charts.

There’s a lot more you can do with Helm, we’ve just scratched the surface. Enjoy using and learning more about them!