Unless you've been living in a remote cave for the last year, you've probably noticed that the world is changing. With the maturing of automation technologies like Puppet, the popular uptake of Cloud Computing, and the rise of Software as a Service, the walls between developers and sysadmins are beginning to be broken down. Increasingly we're beginning to hear phrases like 'Infrastructure is code', and terms like 'Devops'. This is all exciting. It also has an interesting knock-on effect. Most development environments these days are at least strongly influenced by, if not run entirely according to 'Agile' principles. Scrum in particular has experienced tremendous success, and adoption by non-development teams has been seen in many cases. On the whole the headline objectives of the Agile movement are to be embraced, but the thorny question of how to apply them to operations work has yet to be answered satisfactorily.
I've been managing systems teams in an Agile environment for a number of years, and after thought and experimentation, I can recommend using an approach borrowed from Lean systems management, called Kanban.
Operations teams need to deliver business value
As a technical manager, my top priority is to ensure that my teams deliver business value. This is especially important for Web 2.0 companies - the infrastructure is the platform -- is the product -- is the revenue. Especially in tough economic times it's vital to make sure that as sysadmins we are adding value to the business.
In practice, this means improving throughput - we need to be fixing problems more quickly, delivering improvements in security, performance and reliability, and removing obstacles to enable us to ship product more quickly. It also means building trust with the business - improving the predictability and reliability of delivery times. And, of course, it means improving quality - the quality of the service we provide, the quality of the staff we train, and the quality of life that we all enjoy - remember - happy people make money.
The development side of the business has understood this for a long time. Aided by Agile principles (and implemented using such approaches as Extreme Programming or Scrum) developers organise their work into iterations, at the end of which they will deliver a minimum marketable feature, which will add value to the business.
The approach may be summarised as moving from the historic model of software development as a large team taking a long time to build a large system, towards small teams, spending a small amount of time, building the smallest thing that will add value to the business, but integrating frequently to see the big picture.
Systems teams starting to work alongside such development teams are often tempted to try the same approach.
The trouble is, for a systems team, committing to a two week plan, and setting aside time for planning and retrospective meetings, prioritisation and estimation sessions just doesn't fit. Sysadmin work is frequently interrupt-driven, demands on time are uneven, frequently specialised and require concentrated focus. Radical shifts in prioritisation are normal. It's not even possible to commit to much shorter sprints of a day, as sysadmin work also includes project and investigation activities that couldn't be delivered in such a short space of time.
Dan Ackerson recently carried out a survey in which he asked sysadmins their opinions and experience of using agile approaches in systems work. The general feeling was that it helped encourage organisation, focus and coordination, but that it didn't seem to handle the reactive nature of systems work, and the prescription of regular meetings interrupted the flow of work. My own experience of sysadmins trying to work in iterations is that they frequently fail their iterations, because the world changed (sometimes several times) and the iteration no longer captured the most important things. A strict, iteration-based approach just doesn't work well for operations - we're solving different problems. When we contrast a highly interdependent systems team with a development team who work together for a focussed time, answering to themselves, it's clear that the same tools won't necessarily be appropriate.
What is Kanban, and how might it help?
Let's keep this really really simple. You might read other explanations making it much more complicated than necessary. A Kanban system is simply a system with two specific characteristics. Firstly, it is a pull-based system. Work is only ever pulled into the system, on the basis of some kind of signal. It is never pushed; it is accepted, when the time is right, and when there is capacity to do the work. Secondly, work in progress (WIP) is limited. At any given time there is a limit to the amount of work flowing through the system - once that limit is reached, no more work is pulled into the system. Once some of that work is complete, space becomes available and more work is pulled into the system.
Kanban as a system is all about managing flow - getting a constant and predictable stream of work through, whilst improving efficiency and quality. This maps perfectly onto systems work - rather than viewing our work as a series of projects, with annoying interruptions, we view our work as a constant stream of work of varying kinds.
As sysadmins we are not generally delivering product, in the sense that a development team are. We're supporting those who do, addressing technical debt in the systems, and looking for opportunities to improve resilience, reliability and performance.
Kanban is usually associated with some tools to make it easy to implement the basic philosophy. Again, keeping it simple, all we need is a stack of index cards and a board.
The word Kanban itself means 'Signal Card' - and is a token which represents a piece of work which needs to be done. This maps conveniently onto the agile 'story card'. The board is a planning tool, and and an information radiator. Typically it is organised into the various stages on the journey that a piece of work goes through. This could be as simple as to-do, in-progress, and done, or could feature more intermediate steps.
The WIP limit controls the amount of work (or cards) that can be on any particular part of the board. The board makes visible exactly who is working on what, and how much capacity the team has. It provides information to the team, and to managers and other people about the progress and priorities of the team..
Kanban teams abandon the concept of iterations altogether. As Andrew Shafer once said to me: "We will just work on the highest priority 'stuff', and kick-ass!"
How does Kanban help?
Kanban brings value to the business in three ways - it improves trust, it improves quality and it improves efficiency.
Trust is improved because very rapidly the team starts being able to deliver quickly on the highest priority work. There's no iteration overhead, it is absolutely transparent what the team is working on, and, because the responsibility for prioritising the work to be done lies outside the technical team, the business soon begins to feel that the team really is working for them.
Quality is improved because the WIP limit makes problems visible very quickly. Let's consider two examples - suppose we have a team of four sysadmins:
The team decides to set a WIP limit on work in progress of one. This means that the team as a whole will only ever work on one piece of work at a time. While that work is being done, everything else has to wait. The effects of this will be that all four sysadmins will need to work on the same issue simultaneously. This will result in very high quality work, and the tasks themselves should get done fairly quickly, but it will also be wasteful. Work will start queueing up ahead of the 'in progress' section of the board, and the flow of work will be too slow. Also it won't always be possible for all four people to work on the same thing, so for some of the time the other sysadmins will be doing nothing. This will be very obvious to anyone looking at the board. Fairly soon it will become apparent that the WIP limit of one is too low.
Suppose we now decide to increase the WIP limit to ten. The syadmins go their own ways, each starting work on one card each. The progress on each card will be slower, because there's only one person working on it, and the quality may not be as good, as individuals are more likely to make mistakes than pairs. The individual sysadmins also don't concentrate as well on their own, but work is still flowing through the system. However fairly soon, something will come up which makes progress difficult. At this stage a sysadmin will pick another card and work on that. Eventually two or three cards will be 'stuck' on the board, with no progress, while work flows around them owing to the large WIP limit. Eventually we might hit a big problem, system wide, that halts progress on all work, and perhaps even impacts other teams. It turns out that this problem was the reason why work stopped on the tasks earlier on. The problem gets fixed, but the impact on the team's productivity is significant, and the business has been impacted too. Has the WIP limit been lower, the team would have been forced to react sooner.
The board also makes it very clear to the team, and to anyone following the team, what kind of work patterns are building up. As an example, if the team's working cadence seems to be characterised by a large number of interrupts, especially for repeatable work, or to put out fires, that's a sign that the team is paying interest on technical debt. The team can then make a strong case for tackling that debt, and the WIP limit protects the team as they do so.
Efficiency is improved simply because this method of working has been shown to be the best way to get a lot of work through a system. Kanban has its origins in Toyota's lean processes, and has been explored and used in dozens of different kinds of work environment. Again, the effects of the WIP limit, and the visibility of their impact on the board makes it very easy to optimise the system, to reduce the cycle time - that is to reduce the time it takes to complete a piece of work once it enters the system.
Another benefit of Kanban boards is that it encourages self-management. At any time any team member can look at the board and see at once what is being worked on, what should be worked on next and, with a little experience, can see where the problems are. If there's one thing sysadmins hate, it's being micro-managed. As long as there is commitment to respect the board, a sysops team will self-organise very well around it. Happy teams produce better quality work, at a faster pace.
How do I get started?
If you think this sounds interesting, here are some suggestions for getting started.
Have a chat to the business - your manager and any internal stakeholders. Explain to them that you want to introduce some work practices that will improve quality and efficiency, but which will mean that you will be limiting the amount of work you do - i.e. you will have to start saying no. Try the puppy dog close: "Let's try this for a month - if you don't feel it's working out, we'll go back to the way we work now".
Get the team together, buy them pizza and beer, and try playing some Kanban games. There are a number of ways of doing this, but basically you need to come up with a scenario in which the team has to produce things, but the work is going to be limited and only accepted when there is capacity. Speak to me if you want some more detailed ideas - there are a few decent resources out there.
Get the team together for a white-board session. Try to get a sense of the kinds of phases your work goes through. How much emergency support work is there? How much general user support? How much project work? Draw up a first cut of a Kanban board, and imagine some scenarios. The key thing is to be creative. You can make work flow left to right, or top to bottom. You can use coloured cards or plain cards - it doesn't matter. The point of the board is to show what work is being done, by whom, and to make explicit what the WIP limits are.
Set up your Kanban board somewhere highly visible and easy to get to. You could use a whiteboard and magnets, a cork board and pins, or just stick cards to a wall with blue tack. You can draw lines with a ruler, or you can use insulating tape to give bold, straight dividers between sections. Make it big, and clear.
Agree your WIP limit amongst yourselves - it doesn't matter what it is - just pick a sensible number, and be prepared to tweak it based on experience.
Gather your current work backlog together and put each piece of work on a card. If you can, sit with the various stakeholders for whom the work is being done, so you can get a good idea of what the acceptance criteria are, and their relative importance. You'll end up with a huge stack of cards - I keep them in a card box, next to the board.
Get your manager, and any stakeholders together, and have a prioritisation session. Explain that there's a work in progress limit, but that work will get done quickly. Your team will work on whatever is agreed is the highest priority. Then stick the highest priority cards to the left of (or above) the board. I like to have a 'Next Please' section on the board, with a WIP limit. Cards can be added or removed by anyone from this board, and the team will pull from this section when capacity becomes available.
Write up a team charter - decide on the rules. You might agree not to work on other people's cards without asking first. You might agree times of the day you'll work. I suggest two very important rules - once a card goes onto the in progress section of the board, it never comes off again, until it's done. And nobody works on anything that isn't on the board. Write the charter up, and get the team to sign it.
Have a daily standup meeting at the start of the day. At this meeting, unlike a traditional scrum or XP standup, we don't need to ask who is working on what, or what they're going to work on next - that's already on the board. Instead, talk about how much more is needed to complete the work, and discuss any problems or impediments that have come up. This is a good time for the team to write up cards for work they feel needs to be done to make their systems more reliable, or to make their lives easier. I recommend trying to get agreement from the business to always ensure one such card is in the 'Next Please' section.
Set up a ticketing system. I've used RT and Eventum. The idea is to reduce the amount of interrupts, and to make it easy to track whatever work is being carried out. We have a rule of thumb that everything needs a ticket. Work that can be carried out within about ten minutes can just be done, at the discretion of the sysadmin. Anything that's going to be longer needs to go on the board. We have a dedicated 'Support' section on our board, with a WIP limit. If there are more support requests than slots on the board, it's up to the requestors to agree amongst themselves which has the greatest business value (or cost).
Have a regular retrospective. I find fortnightly is enough. Set aside an hour or so, buy the team lunch, and talk about how the previous fortnight has been. Try to identify areas for improvement. I recommend using 'SWOT' (strengths, weaknesses, opportunities, threats) as a template for discussion. Also try to get into the habit of asking 'Five Whys' - keep asking why until you really get to the root cause. Also try to ensure you fix things 'Three ways'. These habits are part of a practice called 'Kaizen' - continuous improvement. They feed into your Kanban process, and make everyone's life easier, and improve the quality of the systems you're supporting.
The use of Kanban in development and operations teams is an exciting new development, but one which people are finding fits very well with a devops kind of approach to systems and development work. If you want to find out more, I recommend the following resources:
- the home of Kanban for software development; A central place where ideas, resources and experiences are shared.
- mailing list for people deploying Kanban in a software environment - full of very bright and experienced people
- the nascent devops movement
- agile web operations - excellent blog covering all aspects of agile operations from a devops perspective
- agile sysadmin - This author's own blog - focussed around the practical application of technology and agile processes to deliver business value