December 25, 2012

Day 25 - CFEngine Sketches

This was written by Aleksey Tsalolikhin.

I learned something new at LISA this year!  CFEngine is building a high-level approach to configuration management called “design sketches”.  “Sketches” are built on top of the comprehensive CFEngine 3 language.  

CFEngine 3 itself is quite comprehensive.  CFEngine 3.3.9 contains:

  • 25 primitives, called “promise types”, which can be used to model system aspects, such as files, processes, services, commands, etc.;
  • 870 potential attributes of those primitives;
  • 95 functions;
  • 2200 total pages of documentation for CFEngine 3.

Because sketches overlie the DSL, you never have to touch the DSL to configure a system.  In other words, the DSL is abstracted, and the configuration becomes data driven. That’s what we really care about, isn’t it?

The reason it’s called a “sketch” is because you take a design pattern (such as configuring sshd to increase security) and fill it with data (the exact settings you want at your site or for a particular group of machines) and only then do you have something usable. You can’t use the sketch by itself, it is incomplete.  You complete it by configuring it.

sketch + data = something usable

This is still 1.0 and being worked on; but I’m very excited about what I’ve seen so far.

Demonstration

Allow me to demonstrated how easy CFEngine makes to implement a design pattern using sketches.  

Let’s start by installing CFEngine and the add-on tool, cf-sketch, which allows us to handle sketches.  I invite you to follow along in a VM:

# Run these commands to get cf-sketch installed
lynx http://www.cfengine.com/inside/myspace # download package
rpm -ihv cfengine-community-3.4.1.rpm # install RPM or dpkg
wget --no-check-certificate https://github.com/cfengine/design-center/raw/master/tools/downloads/cf-sketch-latest.tar.gz
tar zxvf cf-sketch-latest.tar.gz # download cf-sketch add-on
cd cf-sketch
make install

cf-sketch is prototyped in Perl.  I’m doing this on CentOS 5.8, and my perl File::Path module wasn’t up to date, so cf-sketch wouldn’t run until I updated File::Path.  I also installed the other Perl modules cf-sketch complained about:

echo "install File::Path" | perl -MCPAN -e 'shell'
perl -MCPAN -e 'install JSON'
perl -MCPAN -e 'install Term::ReadLine::Gnu'
yum -y install perl-libwww-perl
yes | perl -MCPAN -e 'install  LWP::Protocol::https'  

I could then fire up cf-sketch with no complaints:

# cf-sketch
Welcome to cf-sketch version 3.4.0b1.
CFEngine AS, 2012.

Enter any command to cf-sketch, use 'help' for help, or 'quit' or '^D' to quit.

cf-sketch>

The "list" command shows cf-sketch ships with the CFEngine standard library, same version as in the main RPM:

cf-sketch> list

The following sketches are installed:

1. CFEngine::stdlib (library)

Use list -v to show the activation parameters.

cf-sketch>

The "info all" command will show you all available sketches - there are 28 sketches available today.  Anybody can submit a sketch.  However, sketches are closely reviewed and curated by CFEngine staff to ensure high quality.  After all, our civilization will be running on configuration management tools and policies!

Let’s try VCS::vcs_mirror -  its purpose is to keep a git (or Subversion) clone up to date and clean.  

First, let’s install it:

cf-sketch>  install VCS::vcs_mirror

Installing VCS::vcs_mirror
Checking and installing sketch files.
Done installing VCS::vcs_mirror

cf-sketch>

“List” now shows the vcs_mirror sketch is installed but not configured:

cf-sketch> list

The following sketches are installed:

1. CFEngine::stdlib (library)
2. VCS::vcs_mirror (not configured)

Use list -v to show the activation parameters.

cf-sketch>

Let’s configure it.  You have to specify the path that you want the clone to be, the origin, and the branch to keep the working tree checked out on.  

So let’s say we want to clone the CFEngine Design Center.  The Design Center contains sketches, examples and tools, and is at https://github.com/cfengine/design-center.git

Let’s say we want to mirror the master branch to /tmp/design-center.

cf-sketch> configure  VCS::vcs_mirror

Entering interactive configuration for sketch VCS::vcs_mirror.
Please enter the requested parameters (enter STOP to abort):

Parameter 'vcs' must be a PATH.
Please enter vcs: /usr/bin/git

Parameter 'path' must be a PATH.
Please enter path: /tmp/design-center

Parameter 'origin' must be a HTTP_URL|PATH.
Please enter origin: https://github.com/cfengine/design-center.git

Parameter 'branch' must be a NON_EMPTY_STRING.
Please enter branch [master]: master

Parameter 'runas' must be a NON_EMPTY_STRING.
Please enter runas [getenv("USER", "128"): cfengine

Parameter 'umask' must be a OCTAL.
Please enter umask [022]: 022

Parameter 'activated' must be a CONTEXT.
Please enter activated [any]: any

Parameter 'nowipe' must be a CONTEXT.
Please enter nowipe [!any]: !any
Configured: VCS::vcs_mirror #1

cf-sketch>

The sketch is now configured and ready for use.  Note the “#1”, that means this an instance of the sketch - you can have more than one instance.

  • runas is the user CFEngine will run the command as.
  • activated refers to the context of the promise - where does it apply.  “any” is a special context that is always true.  If we wanted to limit this policy to linux servers, we could have put “linux” there.  Or “Wednesday” if we wanted this policy to only run on Wednesdays.
  • nowipe refers to saving local differences. We set it to “not any” which means, always wipe local differences.

If you have any questions about what the parameters mean, the sketch is documented in /var/cfengine/inputs/sketches/VCS/vcs_mirror/README.md.  All sketches get installed under /var/cfengine/inputs/sketches/ and come with documentation.

The documentation is not yet available from within the cf-sketch shell but this will be added next year.

Let’s check the configuration:

cf-sketch> list -v

The following sketches are installed:

1. CFEngine::stdlib (library)
2. VCS::vcs_mirror (configured)
        Instance #1: (Activated on 'any')
                branch: master
                nowipe: !any
                origin: https://github.com/cfengine/design-center.git
                path: /tmp/design-center
                runas: cfengine
                umask: 022
                vcs: /usr/bin/git

cf-sketch>

Now let’s run our sketches, to make sure they work OK:

cf-sketch> run

Generated standalone run file /var/cfengine/inputs/standalone-cf-sketch-runfile.cf

Now executing the runfile with: /var/cfengine/bin/cf-agent  -f /var/cfengine/inputs/standalone-cf-sketch-runfile.cf

cf-sketch>

Check in /tmp/design-center.  It now contains a mirror of the design-center repo.  Pass!

Now let’s deploy our sketch so it is run automatically by CFEngine (which runs every 5 minutes):

cf-sketch> deploy

Generated non-standalone run file /var/cfengine/inputs/cf-sketch-runfile.cf
This runfile will be automatically executed from promises.cf

cf-sketch> list

Advanced Uses

You can deploy a sketch on your policy hub and have it affect your entire infrastructure or a portion of it that you specify (that’s the “activated” field in the “list -v” output - where the policy should be activated).

You can capture the configuration data of a sketch instance into a JSON file, and move it from one infrastructure to another.

Further Reading

Appendix: 28 sketches available now

  • CFEngine::sketch_template - Standard template for Design Center sketches
  • CFEngine::stdlib - CFEngine standard library (also known as COPBL)
  • Cloud::Services - Manage EC2 and VMware instances
  • Database::Install - Install and enable the MySQL, Postgres or SQLite database engines
  • Library::Hardware::Info - Discover hardware information
  • Monitoring::Snmp::hp_snmp_agents - Install and optionally configure hp-snmp-agents
  • Monitoring::nagios_plugin_agent - Run Nagios plugins and optionally take action
  • Packages::CPAN::cpanm - Install CPAN packages through App::cpanminus
  • Repository::Yum::Client - Manage yum repo client configs in /etc/yum.repos.d
  • Repository::Yum::Maintain - Create and keep Yum repository metadata up to date
  • Repository::apt::Maintain - Manage deb repositories in /etc/apt/sources.list.d/ files or /etc/apt/sources.list
  • Security::SSH - Configure and enable sshd
  • Security::file_integrity - File hashes will be generated at intervals specified by ifelapsed. On modification, you can update the hashes automatically. In either case, a local report will be generated and transferred to the CFEngine hub (CFEngine Enterprise only). Note that scanning the files requires a lot of disk and CPU cycles, so you should be careful when selecting the amount of files to check and the interval at which it happens (ifelapsed).
  • Security::security_limits - Configure /etc/security/limits.conf
  • Security::tcpwrappers - Manage /etc/hosts.{allow,deny}
  • System::config_resolver - Configure DNS resolver
  • System::cron - Manage crontab and /etc/cron.d contents
  • System::etc_hosts - Manage /etc/hosts
  • System::set_hostname - Configure system hostname
  • System::sysctl - Manage sysctl values
  • System::tzconfig - Manage system timezone configuration
  • Utilities::abortclasses - Abort execution if a certain file exists, aka 'Cowboy mode'
  • Utilities::ipverify - Execute a bundle if reachable ip has known MAC address
  • Utilities::ping_report - Report on pingability of hosts
  • VCS::vcs_mirror - Check out and update a VCS repository.
  • WebApps::wordpress_install - Install and configure Wordpress
  • Webserver::Install - Install and configure a webserver, e.g. Apache
  • Yale::stdlib - Yale standard library

December 24, 2012

Day 24 - Twelve things you may not know about Chef

This was written by Joshua Timberman.

In this post, we will discuss a number of features that can be used in managing systems with Chef, but may be overlooked by some users. We'll also look at some features that are not so commonly used, and may prove helpful.

Here's a table of contents:

  1. Resources are first class citizens
  2. In-place file editing
  3. File Checksum comparisons
  4. Version matching
  5. Encrypting Data for Chef's Use
  6. Chef has a REPL
  7. Working with the Resource Collection
  8. Extending the Recipe DSL with helpers
  9. Load and execute a single recipe
  10. Integrating Chef with Your Tools
  11. Sending information to various places
  12. Tagging nodes

(1) Resources are first class citizens

This is probably something most readers who are familiar with Chef already do know. However, we do encounter some uses of Chef that indicate that the author didn't know this. For example, this is from an actual recipe I have seen:

execute "yum install foo" do
  not_if "rpm -qa | grep '^foo'"
end

execute "/etc/init.d/food start" do
  not_if "ps awux | grep /usr/sbin/food"
end

This totally works, assuming that the grep doesn't turn up a false positive (someone reading the 'food' man page?). However, there are resources for this pattern kind of thing, so it's best to use them instead:

package "foo" do
  action :install
end

service "food" do
  action :start
end

Core Chef Resources

Chef comes with a great many resources. These are for managing common components of operating systems, but also primitives that can be used to use on their own, or compose new resources.

Some common resources:

These actually make up probably 80% or more of the resources people will use. However, Chef comes with a few other resources that are less commonly used but still highly useful.

The scm resource has two providers, git and subversion, which can be used as the resource type. These are useful if a source repository must be checked out. For example, myproject is in subversion, and your project is in git.

subversion "myproject" do
  repository "svn://code.example.com/repos/myproject/trunk"
  destination "/opt/share/myproject"
  revision "HEAD"
  action :checkout
end

git "yourproject" do
  repository "git://github.com/you/yourproject.git"
  destination "/usr/local/src/yourproject"
  reference "1.2.3" # some tag
  action :checkout
end

This is used under the covers in the deploy resource.

The ohai resource can be used to reload attributes on the node that come from Ohai plugins.

For example, we can create a user, and then tell ohai to reload the plugin that has all user and group information.

ohai "reload_passwd" do
  action :nothing
  plugin "passwd"
end

user "daemonuser" do
  home "/dev/null"
  shell "/sbin/nologin"
  system true
  notifies :reload, "ohai[reload_passwd]", :immediately
end

Or, we can drop off a new plugin as a template, and then load that plugin.

ohai "reload_nginx" do
  action :nothing
  plugin "nginx"
end

template "#{node['ohai']['plugin_path']}/nginx.rb" do
  source "plugins/nginx.rb.erb"
  owner "root"
  group "root"
  mode 00755
  notifies :reload, 'ohai[reload_nginx]', :immediately
end

If your recipe(s) manipulate system state that future resources need to be aware of, this can be quite helpful.

The http_request resource makes... an HTTP request. This can be used to send (or receive) data via an API.

For example, we can send a request to retrieve some information:

http_request "some_message" do
  url "http://api.example.com/check_in"
end

But more usefully, we can send a POST request. For example, on a Chef Server with CouchDB (Chef 10 and earlier), we can compact the database:

http_request "compact chef couchDB" do
  url "http://localhost:5984/chef/_compact"
  action :post
end

If you're building a custom lightweight resource/provider for an API service like a monitoring system, this could be a helpful primitive to build upon.

Opscode Cookbooks

Aside from the resources built into Chef, Opscode publishes a number of cookbooks that contain custom resources, or "LWRPs". See the README for these cookbooks for examples.

There's many more, and documentation for them is on the Opscode Chef docs site.

(2) In-place file editing

For a number of reasons, people may need to manage the content of files by replacing or adding specific lines. The common use case is something like sysctl.conf, which may have different tuning requirements from different applications on a single server.

This is an anti-pattern

Many folks who practice configuration management see this as an anti-pattern, and recommend managing the whole file instead. While that is ideal, it may not make sense for everyone's environment.

But if you really must...

The Chef source has a handy utility library to provide this functionality, Chef::Util::FileEdit. This provides a number of methods that can be used to manipulate file contents. These are used inside a ruby_block resource so that the Ruby code is done during the "execution phase" of the Chef run.

ruby_block "edit etc hosts" do
  block do
    rc = Chef::Util::FileEdit.new("/etc/hosts")
    rc.search_file_replace_line(
      /^127\.0\.0\.1 localhost$/,
      "127.0.0.1 #{new_fqdn} #{new_hostname} localhost"
    )
    rc.write_file
  end
end

For another example, Sean OMeara has written a line that includes a resource/provider to append a line in a file if it doesn't exist.

(3) File Checksum comparisons

In managing file content with the file, template, cookbook_file, and remote_file resources, Chef compares the content using a SHA256 checksum. This class can be used in your own Ruby programs or libraries too. Sure, you can use the "sha256sum" command, but this is native Ruby instead of shelling out.

The class to use is Chef::ChecksumCache and the method is #checksum_for_file.

require 'chef/checksum_cache'
sha256 = Chef::ChecksumCache.checksum_for_file("/path/to/file")

(4) Version matching

It is quite common to need version string comparison checks in recipes. Perhaps we want to match the version of the platform this node is running on. Often we can simply use a numeric comparison between floating point numbers or strings:

if node['platform_version'].to_f == 10.04
if node['platform_version'] == "6.3"

However, sometimes we have versions that use three points, and matching on the third portion is relevant. This would get lost in #to_f, and greater/less than comparisons may not match with strings.

Chef::VersionConstraint

The [Chef::VersionConstraint](http://rubydoc.info/gems/chef/10.16.2/Chef/VersionConstraint) class can be used for version comparisons. It is modeled after the version constraints in Chef cookbooks themselves.

First we initialize the Chef::VersionConstraint with an argument containing the comparison operator and the version as a string. Then, we send the #include? method with the version to compare as an argument. For example, we might be checking that the version of OS X is 10.7 or higher (Lion).

require 'chef/version_constraint'
Chef::VersionConstraint.new(">= 10.7.0").include?("10.6.0") #=> false
Chef::VersionConstraint.new(">= 10.7.0").include?("10.7.3") #=> true
Chef::VersionConstraint.new(">= 10.7.0").include?("10.8.2") #=> true

Or, in a Chef recipe we can use the node's platform version attribute. For example, on a CentOS 5.8 system:

Chef::VersionConstraint.new("~> 6.0").include?(node['platform_version']) # false

But on a CentOS 6.3 system:

Chef::VersionConstraint.new("~> 6.0").include?(node['platform_version']) # true

Chef's version number is stored as a node attribute (node['chef_packages']['chef']['version']) that can be used in recipes. Perhaps we want to check for a particular version because we're going to use a feature in the recipe only available in newer versions.

version_checker = Chef::VersionConstraint.new(">= 0.10.10")
mac_service_supported = version_checker.include?(node['chef_packages']['chef']['version'])

if mac_service_supported
  # do mac service is supported so do these things
end

(5) Encrypting Data for Chef's Use

By default, the data stored on the Chef Server is not encrypted. Node attributes, while containing useful data, are plaintext for anyone that has a private key authorized to the Chef Server. However, sometimes it is desirable to store encrypted data, and Data Bags (stores of arbitrary JSON data) can be encrypted.

You'll need a secret key. This can be a phrase or a file. The key needs to be available on any system that will need to decrypt the data. A cryptographically strong secret key is best, and can be generated with OpenSSL:

openssl rand -base64 512 > ~/.chef/encrypted_data_bag_secret

Next, create the data bag that will contain encrypted items. For example, I'll use secrets.

knife data bag create secrets

Next, create the items in the bag that will be encrypted.

knife data bag create secrets credentials --secret-file ~/.chef/encrypted_data_bag_secret
{
  "id": "credentials",
  "user": "joshua",
  "password": "dirty_secrets"
}

Then, view the content of the data bag item:

knife data bag show secrets credentials
id:        credentials
password:  cKZgOISOE+lmRiqf9j5LlRegtcILqvVw6XRft11T7Pg=

user:      mBf1UDwAGq0N0Ohqugabfg==

Naturally, this is encrypted using the secret file. Decrypt it:

knife data bag show secrets credentials --secret-file ~/.chef/encrypted_data_bag_secret
id:        credentials
password:  dirty_secrets
user:      joshua

To use this data in a recipe, the secret file must be copied and its location configured in Chef. The knife bootstrap command can do this automatically if your knife.rb contains the encrypted_data_bag_secret configuration. Presuming that the .chef directory contains the knife.rb and the above secret file:

encrypted_data_bag_secret "./encrypted_data_bag_secret"

In a Recipe, Chef::EncryptedDataBagItem

Nodes bootstrapped using the default bootstrap template will have the secret key file copied to /etc/chef/encrypted_data_bag_secret, and available for Chef. This is a constant in the Chef::EncryptedDataBagItem class, DEFAULT_SECRET_FILE. To use this in a recipe, use the #load_secret method, then pass that as an argument to the #load method for the data bag item. Finally, access various keys from the item like a Ruby Hash. Example below:

secret = Chef::EncryptedDataBagItem.load_secret(Chef::EncryptedDataBagItem::DEFAULT_SECRET_FILE))
user_creds = Chef::EncryptedDataBagItem.load("secrets","credentials", secret)
user_creds['id'] # => "credentials"
user_creds['user'] # => "joshua"
user_creds['password'] # => "dirty_secrets"

(6) Chef has a REPL

Chef comes with a built-in "REPL" or shell, called shef. A REPL is "Read, Eval, Print, Loop" or "read what I typed in, evaluate it, print out the results, and do it again." Other examples of REPLs are Python's python w/ no arguments, a Unix shell, or Ruby's irb.

shef (chef-shell in Chef 11)

In Chef 10 and earlier, the Chef REPL is invoked as a binary named shef. In Chef 11 and later, it is renamed to chef-shell. Additional options can be passed to the command-line, including a config file to use, or an over all mode to use (solo or client/server). See shef --help for options.

Once invoked, shef has multiple run-time contexts that can be used:

  • main
  • recipe (recipe_mode in Chef 11)
  • attributes (attributes_mode in Chef 11)

At any time, you can type "help" to get context specific help. The "main" context provides a number of API helper methods. The "attributes" context functions as a cookbook's attributes file. The "recipe" context is in the Chef recipe DSL context, where resources can be created and run. For example:

chef:recipe > package "zsh" do
chef:recipe >   action :install
chef:recipe ?> end
 => <package[zsh] @name: "zsh" @package_name: "zsh" @resource_name: :package >

(the output is trimmed for brevity, try it on your own system)

This works similar to how Chef actually works when processing recipes. It has recognized the input as a Chef Resource and added it to the resource collection. This doesn't actually manage the resource until we enter the execution phase, similar to a Chef run. We can do that with the shef method run_chef:

chef:recipe > run_chef
[2012-12-23T12:32:27-07:00] INFO: Processing package[zsh] action install ((irb#1) line 1)
[2012-12-23T12:32:27-07:00] DEBUG: package[zsh] checking package status for zsh
zsh:
  Installed: 4.3.17-1ubuntu1
  Candidate: 4.3.17-1ubuntu1
  Version table:
 *** 4.3.17-1ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ precise/main amd64 Packages
        100 /var/lib/dpkg/status
[2012-12-23T12:32:27-07:00] DEBUG: package[zsh] current version is 4.3.17-1ubuntu1
[2012-12-23T12:32:27-07:00] DEBUG: package[zsh] candidate version is 4.3.17-1ubuntu1
[2012-12-23T12:32:27-07:00] DEBUG: package[zsh] is already installed - nothing to do
 => true

There are many possibilities for debugging and exploring with this tool. For example, use it to test the examples that are presented in this post.

chef/shef/ext (renamed in Chef 11)

The methods available in the "main" context of Shef are also available to your own scripts and plugins by requiring Chef::Shef::Ext. In Chef 11, this will be Chef::Shell::Ext, though the old one is present for compatibility.

require 'chef/shef/ext'
Shef::Extensions.extend_context_object(self)
nodes.all # => [node[doppelbock], node[cask], node[ipa]]

(7) Working with the Resource Collection

One of the features of Chef is that Recipes are pure Ruby. As such, we can manipulate things that are in the Object Space, such as other Chef objects. One of these is the Resource Collection, the data structure that contains all the resources that have been seen as Chef processes recipes. Using shef, or any Chef recipe, we can work with the resource collection for a variety of reasons.

Look Up Another Resource

The #resources method will return an array of all the resources. From our shef session earlier, we have a single resource:

chef:recipe > resources
["package[zsh]"]

We can add others.

chef:recipe > service "food"
chef:recipe > file "/tmp/food-zsh-completion"

Now when we look at the resource collection, we'll see the new resources:

chef:recipe > resources
["package[zsh]", "service[food]", "file[/tmp/food-zsh-completion]"]

We can use the resources method to open a specific resource.

"Re-Open" Resources to Modify/Override

If we look at the service[food] resource that was created (using all default parameters), we'll see:

chef:recipe > resources("service[food]")
<service[food] @name: "food" @noop: nil @before: nil @params: {} @provider: nil @allowed_actions: [:nothing, :enable, :disable, :start, :stop, :restart, :reload] @action: "nothing" @updated: false @updated_by_last_action: false @supports: {:restart=>false, :reload=>false, :status=>false} @ignore_failure: false @retries: 0 @retry_delay: 2 @source_line: "(irb#1):2:in `irb_binding'" @elapsed_time: 0 @resource_name: :service @service_name: "food" @enabled: nil @running: nil @parameters: nil @pattern: "food" @start_command: nil @stop_command: nil @status_command: nil @restart_command: nil @reload_command: nil @priority: nil @startup_type: :automatic @cookbook_name: nil @recipe_name: nil>

To work with this, it is easier to assign to a local variable.

chef:recipe > f = resources("service[food]")

Then, we can call the various parameters as accessor methods.

chef:recipe > f.supports
 => {:restart=>false, :reload=>false, :status=>false}

We can modify this by sending the supports method to f with additional arguments. For example, maybe the food service supports restart and status commands, but not reload:

chef: recipe > f.supports({:restart => true, :status => true})
 => {:restart=>true, :status=>true}

As a more practical example, perhaps you want to use a cookbook from the Chef Community Site that manages a couple services on Ubuntu. However, the author of the cookbook hasn't updated the cookbook in a while, and those services are managed by upstart instead of being init.d scripts. You could create a custom cookbook that "wraps" the upstream cookbook with a recipe like this to modify those service resources:

if platform?("ubuntu")
  ["service_one, "service_two].each do |s|
    srv = resource("service[#{s}]")
    srv.provider Chef::Provider::Service::Upstart
    srv.start_command "/usr/bin/service #{s} start"
  end
end

Then in the node's run list, you'd have the upstream cookbook's recipe and your custom recipe:

{
  "run_list": [
    "their_upstream",
    "your_custom"
  ]
}

This is a pattern that has become popular with the idea of "Library" vs. "Application" cookbooks, and Bryan Berry has a RubyGem to provider a helper for it.

(8) Extending the Recipe DSL with helpers

One of the features of a Chef cookbook is that it can contain a "libraries" directory with files containing helper libraries. These can be new Chef Resources/Providers, ways of interacting with third party services, or simply extending the Chef Recipe DSL.

Let's just have a simple method that shortcuts the Chef version attribute so we don't have to type the whole thing in our recipes.

First, create a cookbook named "my_helpers".

knife cookbook create my_helpers

Then create the library file. This can be anything you want, all library files are loaded by Chef.

touch cookbooks/my_helpers/libraries/default.rb

Then, since we are extending the Chef Recipe DSL, add this method to its class, Chef::Recipe.

class Chef
  class Recipe
    def chef_version
      node['chef_packages']['chef']['version']
    end
  end
end

To use this in a recipe, simply call that method. From the earlier example:

mac_service_supported = version_checker.include?(chef_version)

Next, I'll use a helper library for the Encrypted Data Bag example from earlier to demonstrate this. I created a separate library file.

touch cookbooks/my_helpers/libraries/encrypted_data_bag_item.rb

It contains:

class Chef
  class Recipe
    def encrypted_data_bag_item(bag, item, secret_file = Chef::EncryptedDataBagItem::DEFAULT_SECRET_FILE)
      DataBag.validate_name!(bag.to_s)
      DataBagItem.validate_id!(item)
      secret = EncryptedDataBagItem.load_secret(secret_file)
      EncryptedDataBagItem.load(bag, item, secret)
    rescue Exception
      Log.error("Failed to load data bag item: #{bag.inspect} #{item.inspect}")
      raise
    end
  end
end

Now, when I want to use it in a recipe, I can:

user_creds = encrypted_data_bag_item("secrets", "credentials)

(9) Load and execute a single recipe

In default operation, Chef loads cookbooks and recipes from their directories on disk. It is actually possible to load a single recipe file by composing a new binary program from Chef's built-in classes. This is helpful for simple use cases or as a general example. Dan DeLeo of Opscode wrote this as a gist awhile back, which I've updated here:

https://gist.github.com//4366061

It's only 45 lines counting whitespace. Simply save that to a file, and then create a recipe file, and run it with the filename as an argument.

root@virt1test:~# wget https://gist.github.com/raw/4366061/68125dcf8767e1f5436e506c2d2a9697605d9802/chef-apply.rb
--2012-12-23 13:56:32--  https://gist.github.com/raw/4366061/68125dcf8767e1f5436e506c2d2a9697605d9802/chef-apply.rb
2012-12-23 13:56:32 (137 MB/s) - `chef-apply.rb' saved [848]

root@virt1test:~# chmod +x chef-apply.rb
root@virt1test:~# ./chef-apply.rb recipe.rb
[2012-12-23T13:56:54-07:00] INFO: Run List is []
[2012-12-23T13:56:54-07:00] INFO: Run List expands to []
[2012-12-23T13:56:54-07:00] INFO: Processing package[zsh] action install ((chef-apply cookbook)::(chef-apply recipe) line 1)
[2012-12-23T13:56:54-07:00] INFO: Processing package[vim] action install ((chef-apply cookbook)::(chef-apply recipe) line 2)
[2012-12-23T13:56:54-07:00] INFO: Processing file[/tmp/stuff] action create ((chef-apply cookbook)::(chef-apply recipe) line 3)

This is the simple recipe:

package "zsh"
package "vim"

file "/tmp/stuff" do
  content "I have some stuff I'm stashing in here."
end

This functionality is quite useful for example purposes, and a ticket (CHEF-3571) was created to track its addition for core Chef.

(10) Integrating Chef with Your Tools

There's a rising ecosystem of tools surrounding chef. Many of them use the Chef REST API to expose cool functionality and let you build your own tooling on top.

spice and ridley (ruby)

spice and ridley provide ruby APIs that talk to Chef.

pychef (python)

pychef gives you a nice api for hitting the Chef API from python.

jclouds (java/clojure)

jclouds has a chef component to let you use the Chef REST api from Java and Clojure. Learn more here

(11) Sending information to various places

Chef has the ability to send output to a variety of places. By default, it will output to standard out. This is managed through the Chef logger, a class called Chef::Log.

The Chef::Log Configuration

The Chef::Log logger has three main configuration options:

  • log_level: the amount of log output to display. Default is "info", but "debug" is common.
  • log_location: where the log output should go. Default is standard out.
  • verbose_logging: whether to display "Processing:" messages for each resource Chef processes. Default is true.

The first two are configurable with command-line options, or in the configuration file. The level is the -l (small ell) option, and the location is the -L (big ell) option.

chef-client -l debug -L debug-output.log

In the configuration file, the level should be specified as a symbol (preceding colon), and the location as a string or constant (if using standard out).

log_level :info
log_location STDOUT

Or:

log_level :debug
log_location "/var/log/chef/debug-output.log"

The verbose output option is in the configuration file. To suppress "Processing" lines, set it to false.

verbose_logging false

Output Formatters

A new feature for log output introduced in Chef 10.14 is "Output Formatters". These can be set with the -F option, or the formatter configuration option. There are some formatters included in Chef:

  • base: the default
  • doc: nicely presented "documentation" type output
  • min: rspec style minimal output

For example, to use the doc style but only for one run:

chef-client -F doc -l fatal

Use the log level fatal so normal logger messages aren't displayed. To make this permenant for all runs, put it in the config file.

log_level :fatal
formatter "doc"

You can create your own formatters, too. An example of this is Andrea Campi's nyan cat formatter. You can deploy this and use it with Sean OMeara's cookbook.

Report/Exception Handlers

Chef has an API for running report/exception handlers at the end of a Chef run. These can display information about the resources that were updated, any exception that occurred, or other data about the run itself. The handlers themselves are Ruby classes that inherit from Chef::Handler, and then override the report method to perform the actual reporting work. Chef handlers can be distributed as RubyGems, or single files.

client.rb

Chef becomes aware of the report or exception handlers through the configuration file. For example, if I wanted to use the updated_resources handler that I wrote as a RubyGem, I would install the gem on the system, and then put the following in my /etc/chef/client.rb.

require "chef/handler/updated_resources"
report_handlers << SimpleReport::UpdatedResources.new
exception_handlers << SimpleReport::UpdatedResources.new

Then at the end of the run, the report would print out the resources that were updated.

chef_handler Cookbook

For handlers that are simply a single file, use Opscode's chef_handler cookbook. It will automatically handle putting the handlers in place on the system, and adding them to the configuration.

Other Handlers

A number of Chef handlers are available from the community and many are listed on the Exception and Report Handlers page. Conventionally, authors often prepend chef-handler to their gem names to make them easier to find. Some common ones you may find useful:

(12) Tagging nodes

A feature that has existed in Chef since its initial release is "node tagging". This is simply a node attribute built in where entries can be added and removed, or queried easily.

Use cases

One can certainly use other node attributes for storing data. Since node attributes can be any JSON object type, arrays are easily available. Howeer, "tags" have some special helpers available, and semantic uses that may make more sense than plain attributes.

Part of the idea is that tags may be added or removed, flipping the node to various states as far as the Chef Server is concerned. For example, one might only want to monitor nodes that have a certain tag, or run data base migrations on a node tagged to do so.

Tags in Chef Recipes

In Chef recipes, we can search for nodes that have a particular tag. Perhaps nodes tagged "decommissioned" shouldn't be monitored.

decommissioned_nodes = search(:node, "tags:decommissioned")

The recipe DSL itself has some tag-specific helper methods, too.

Use tagged? to see if the node running Chef has a specific tag:

if tagged?("decommissioned")
  raise "Why am I running Chef if I'm decommissioned?"
end

Perhaps more usefully:

if tagged?("run_migrations")
  execute "rake db:migrate" do
    cwd "/srv/myapp/current"
  end
end

If the tags of the node need to be modified during a run, that can be done with the tag and untag methods.

tag("deployed")
log "I'm printed if the tag deployed is set." do
  only_if { tagged?("deployed") }
end

Or perhaps more usefully, untag the node after the migrations from earlier are run:

if tagged?("run_migrations")
  execute "rake db:migrate" do
    cwd "/srv/myapp/current"
    notifies :create, "ruby_block[untag-run-migrations]", :immediately
  end
end

ruby_block "untag-run-migrations" do
  block do
    untag("run_migrations")
  end
  only_if { tagged?("run_migrations") }
end

Knife Commands

There are knife commands for viewing and manipulating node tags.

View the tags of a node:

knife tag list web23.example.com
decommissioned

Add a tag to a node:

knife tag create web23.example.com powered_off
Created tags powered_off for node web23.example.com.

Remove a tag from a node:

knife tag delete web23.example.com powered_off
Deleted tags powered_off for node web23.example.com.

Conclusion

Hopefully this post contains a number of things you didn't know were available to Chef, and will be useful in your Chef environment.

December 23, 2012

Day 23 - Down and Dirty Log File Filtering with Perl

This was written by Phil Hollenback (www.hollenback.net)

Here We Go Again

Say you have a big long logfile on some server somewhere and a need to analyze that file. You want to throw away all the regular boring stuff in the logfile and just print lines that look suspicious. Further, if you find certain critical keywords, you want to flag those lines as being extra suspicious. It would also be nice if this was just one little self-contained script. How are you going to do that?

Sure, I know right now you are thinking "hey Phil just use a real tool like logstash, ok?" Unfortunately, I'm not very good at following directions, so I decided to implement this little project with my favorite tool: Perl. This post will shed some light on how I designed the script and how you could do something similar yourself.

Requirements and Tools

This whole thing was a real project I worked on in 2012. My team generated lots of ~10,000 line system install logs. We needed a way to quickly analyze those logs after each install.

This tool didn't need to be particularly fast, but it did need to be relatively self-contained. It was fine if analyzing a logfile took 30 seconds, since these logfiles were only being generated at the rate of one every few hours.

For development and deployment convenience, I decided to embed the configuration in the script. That way I didn't have to deal with updating both the script and the config files separately. I did decide however to try to do this in a modular way so I could later separate the logic and config if needed.

I had been previously playing with embedding data in my perl scripts through use of the DATA section so I wanted to use that approach for this script. However, that presented a problem: I knew I wanted three separate configuration sections (I will explain that later). This meant that I would have to do something to split the __DATA__ section of my file in to pieces with keywords or markers.

Naturally, I found a lazy way to do this: the Inline::Files perl module. This module gives you the ability to define multiple data sections in your script, and each section can be read as a regular file. Perfect for my needs.

Reading the Logfile

The first step in this script is to read in the logfile data. As I mentioned, it's around 10,000 lines at most, not a staggering amount. This is a small enough number that you don't need to worry about memory constraints - just read the whole file in to memory. With that, here's the beginning of the script:

#!/usr/bin/perl

use warnings;
use strict;
use Inline::Files;

my $DEBUG = 0;
my @LOGFILE = ();
my (@CRITICAL, @WARN, @IGNORE);
my (@CriticalList, @WarnList);

while(<>)
{
  # skip comments
  next if /^#/;
  push(@LOGFILE, split("\n"));
}

What does that all do? Well of course the first few lines set up our initial environment and load modules. Then I define a LOGFILE variable. Finally the while loop uses the magic <> operator to take input lines from either STDIN or from any file specified on the command line.

Inside the while loop the combination of shift and push converts each line of input text into an array entry. At the end of this whole thing I've got an array @LOGFILE which contains the logfile data. As I mentioned, this works just fine for logfiles which aren't enormous.

Reading the Data

Next, I want to create arrays for three sorts of matches. First, I need a list of matches which we should warn about. Second, I need a list of matches that we ignore. The third list is a list of critical errors that I always want to flag no matter what.

The idea here is that the first list of matches is the broadest, and is intended to catch all possible anomalies. For example, the word 'error' is caught by the warn list.

The second list is more discriminatory, and acts to filter the first list. Any time a match is found on the warn list, it's checked against the ignore list and discarded if a match is found. Thus, the line

this is a spurious warning

is initially flagged by the warn list. However, the ignore list includes a match for 'spurious warning' so ultimately this line gets suppressed as we know it's spurious.

The final list short circuits the entire process - if any match is found on the critical list, no checking of the warn or ignore lists is done. This list is intended for only specific critical failures and nothing else. That way we can invoke a special code path for the worst problems and do things like exit with a non-zero value.

Remember that we are using Inline::Files so we can treat sections of the script as data files. Here's the end of the script that can be used to configure the run:

__DATA__
# nothing after this line is executed
__CRITICAL__
# everything in this section will go in the critical array
.*File.* is already owned by active package
__IGNORE__
# everything in this section will go in the ignore array
warning: unable to chdir
set warn=on
__WARN__
# everything in this section will go in the warn array
error
fail
warn

We can now treat CRITICAL, WARN, and IGNORE as regular files and open them for reading like so:

open CRITICAL or die $!;
while(<CRITICAL>)
{
  next if /^#/; #ignore comments
  chomp;
  push @CRITICAL, $_;
}
close CRITICAL;

Repeat for WARN and IGNORE. We now have three arrays of matches to evaluate against the logfile array.

Pruning the Logs

Now, we need to act on the log data. The simplest way to do this is with a bunch of for loops. This actually works just fine with 10,000 line logfiles and a few dozen matches. However, let's try to be slightly more clever and optimize (prematurely?). We can compile all the regexes so we don't have to evaluate them every time:

my @CompiledCritical = map { qr{$_} } @Critical;
my @CompiledWarn = map { qr{$_} } @Warn;
my @CompiledIgnore = map { qr{$_} } @Ignore;

then we loop through all the logfile output and apply the three matches to it. We use a label called OUTER to make it easy to jump out of the loop at any time and skip further processing.

OUTER: foreach my $LogLine (@LOGFILE)
{
  # See if we found any critical errors
  foreach my $CriticalLine (@CompiledCritical)
  {
    if ($LogLine =~ /$CriticalLine/)
    {
      push @CriticalList, $LogLine;
      next OUTER;
    }
  }
  # Any warning matches?
  foreach my $WarnLine (@CompiledWarn)
  {
    if ($LogLine =~ /$WarnLine/i)
    {
      # see if suppressed by an IGNORE line
      foreach my $IgnoreLine (@CompiledIgnore)
      {
        if ($LogLine =~ /$IgnoreLine/)
        {
          # IGNORE suppresses WARN
          next OUTER;
        }
      }
      # ok, the warning was not suppressed by IGNORE
      push @WarnList, $LogLine;
      next OUTER;
    }
  }
}

Output the Results

With that, the script is essentially complete. All that remains of course is outputting the results. The simplest way to do that is to loop through the CRITICAL and WARN arrays:

if (@CriticalList)
{
  print "Critical Errors Found!\n";

  while(my $line = shift @CriticalList)
  {
    print $line . "\n";
  }
  print "\n";
}

if (@WarnList)
{
  print "Suspicious Output Found!\n";

  while(my $line = shift @WarnList)
  {
    print $line . "\n";
  }
  print "\n";
}

Assuming a logfile like this:

1 this is a warning: unable to chdir which will be suppressed
2 this is an error which will be flagged
3 set warn=on
4 this is superfluous
5 set warn=off
6 looks like File foobar is already owned by active package baz

the script outputs the following:

$ scan.pl log.txt
Critical Errors Found!
6 looks like File foobar is already owned by active package baz

Suspicious Output Found!
2 this is an error which will be flagged
5 set warn=off

The end result is exactly what we want - a concise list of problematic log lines.

Conclusion

This is a pretty simple example, but hopefully, it gives you some ideas to play with. As I said in the beginning, I realize that there are lots and lots of other more clever ways to go about this sort of log analysis. I won't claim that this is even a particularly smart way to go about things. What I can tell you is that a variation on this script solved a particular problem for me, and it solved the problem very well.

The thing I'm really trying to illustrate here is that scripting isn't that hard. If you are a sysadmin you absolutely must be comfortable with scripting. I prefer perl, but rumor has it, there are some other scripting languages out there. If you have the ability to throw together a script you can quickly and easily automate most of your daily tasks. This is not just theory. I used to review multiple 10,000 line install logfiles by hand. Now I don't have to do anything but look at the occasional line that the script flags. I freed up a couple hours a week with this approach and I encourage all of you to take the same approach if you aren't already.

Further Reading

December 22, 2012

Day 22 - Be a Fire Marshal, Not a Fire Fighter

This was written by Kate Matsudaira.

It’s the end of the year, time for contemplation ... and resolutions. As you think back on this year, were you a Fire Marshal or a Fire Fighter?

Being a fire fighter can seem rewarding; they swoop in save the day and play the hero. However, in a highly effective team there should be no heroes. The recognition that comes with saving the day often outweighs the silence that comes from a smooth, seamless deployment or issues resolved without customer impact.

The thing about operations and infrastructure teams is that a lot of the work isn’t noticeable unless it is done wrong. Executives pay attention and take notice when things go awry or problems, like outages, occur. Great work, streamlined processes, and reduced costs are sometimes harder to see.

Far too often, operations and devops teams enter into a death spiral of reactivity - a positive feedback loop without preventive planning. These teams fail to allocate time to proactive measures, which, in turn, leads to crisis after crisis. How does such a team, otherwise beaten down and demoralized by operational problems, find a way to reboot into a "proactive" state of being?

Whether you are a team member or the team manager, you can help pull your team of fighters into a more organized brigade setup to put fires out before they become a problem.

When it comes to fire fighting, there are typically two things at play: the way we select problems to solve, and how we solve them.

Selecting Problems

  • Reactive teams work on every problem that comes up.
  • The project in progress is always the one with the most urgent deadline.
  • The priority of the problem is often dictated by the originator of the request, instead of the business need.

Solving Problems

  • Reactive teams give all problems to the fastest resolver, regardless of past or current assignments.
  • There are often lots of inefficiencies – too many people in meetings or involved in resolving incidents.
  • Since everything is urgent, there isn’t enough focus on the root causes; resulting in lots of patches and hacks.

In order to move from being fire fighter to a fire marshal, it is important to devise a strategy that will address both. Here are some suggestions and recommendations to get you started:

  • All equipment should be operable by multiple people. You wouldn’t want a fire department with one only person who could operate the truck. The same goes for software. Are there any systems or tools that only one person knows how to debug or diagnose? Inevitably, these single points of failure can result in fire drills and can cause inefficiencies. Taking the time to train multiple people on each part of the system will create redundancy and improve overall team knowledge of operations, and invariably, it creates opportunities through previously unnoticed synergies.
  • Plan to plan. It is easy to get caught up into what is urgent and tactical, but proactive planning won’t happen without explicitly creating time for it. Moreover, periodically moving players between the duties of ops tactics and strategy planning helps balance the load and can provide an often much needed break from the line of fire. Draw a clear "line in the sand" between the strategy and operations; build a firewall, but fairly, so as to not prevent progress.
  • Put players on assignments that play to their strengths. It is true that learning new things and pushing people to challenge themselves is wise when it comes to long term career development. However, in a team that is more reactive and constantly dealing with issues, it may be best to give each person an assignment they will be able to knock out of the park. By giving people the work that they are best at, they can complete it quickly and hopefully have extra cycles to help with more proactive preventive assignments. Moreover, if there are players on your team with a higher tolerance to crisis, setup roles that align players with their tolerance level. Some people thrive on the reactivity, and can play the role of your "trauma surgeons," so to speak.
  • Focus on root cause. When there is a crisis it is important to focus on what will stop the damage, and get the systems back up and operating normally. And in reactive situations once things are stabilized a complete diagnosis can be put on the backburner with the temporary solution put together on gut feeling. Resulting in more brittle and unstable software. By helping create a culture of understanding and training other problem solvers people are able to take on more responsibility and solve problems properly.
  • Show signs of real change. If you are adopting a new paradigm or cultural shift, aim for shock-and-awe: do something remarkable to demonstrate that you are breaking the cycle of reactivity (e.g. take the most legacy behavior/process and either eradicate it or drastically modify it). It is usually pretty obvious when a team is under water from incidents, but it can be hard to see what has been postponed in lieu of all these urgent problems. What are the important items that need to be tackled? Make a list or add them to your backlog; simply cataloging this work will provide visibility and then progress.
  • Let go of the "unhappy's". Both figuratively and literally. Very seldom is change welcomed with open arms, especially among us engineers who tend to like patterns and routine. Try to enlist help at the top from your manager or other senior team members. Expect to be met with some resistance, and when it happens, ask questions and try to understand their concerns; you may be surprised by some good insights and ideas. Be bold; come with a plan and enact quickly. Show confidence but flexibility and willingness to listen.
  • Create a positive feedback cycle. A key part of any change is to show progress and small victories. Signify progress by planning an off-site or organizing team building exercises that otherwise would've been impossible due to constant fire-fighting. Leverage metrics like constant trend analysis reporting with statistically useful "human" measures that show progress and improvements. Just don't reward people just for making more work for themselves. That's like rewarding professional fire fighters for acts of arson! Further, it sends a wrong message about the kind of work behavior that is valued.

Some of these suggestions are easier to execute in a leadership role, but it doesn’t matter what your level or job is. Taking time to step back and think about proactively addressing issues before they arise is a great way to improve your work for you and your teammates. Bonus, that in many cases can also lead to upward advancement and promotions.

Leadership is about influence, and management is about authority; and most of these suggestions can be achieved without any formal authority. Although I would still recommend getting your manager or team lead aligned with your mission – it will make it even easier to have a strong ally.

And remember, you're training a team of Fire Marshals, not Fire Fighters. Firefighting is an ephemeral state. Instead, a constant state of vigilance and fire prevention is what you want to engender.

Further Reading

December 21, 2012

Day 21 - The Double-Hop Nightmare

This was written by Sam Cogan.

Kerberos, the authentication protocol, is not something that usually needs to be thought about - it usually just works, but when you move into the world of impersonation, delegation, and the double hop, things start to get complicated. There are many articles on the internet that explain the process to make double hop delegation work, what they don’t explain is the many pitfalls that can occur on the way that lead to you feeling like you’d rather be gouging your eyes out with spoons.

What is this double hop of which we speak? The most common example is where you have a client accessing a website and using Windows authentication which, in turn, accesses a back-end database with that users credentials.

If any old web server could delegate user credentials to another back-end server, we would have the potential for security issues, so in the Windows world we need to configure the system to allow this type of delegation in a secure manner. This is where the problems start.

SPN's

To be able to delegate a service, you need to create a Service Principal Name for each service. In theory, this is a relatively trivial command to run:

setspn -A <service name>/<machine name> <domain>\<account name>

However, there are a number of possible pitfalls to be aware off:

  • You need to ensure you create an SPN for all possible names. This includes both FQDN and Netbios names, and any aliases, if you are going to access the service by something other than the machine name
  • Duplicate SPN’s cause problems - make sure you don’t have other SPN’s for the same service but different user account name
  • If your service is running as a system account, it will usually create the SPN for you, so check before you try and create another. SQL server for example does this
  • Ensure the service account that you are creating the SPN for is in the same domain as the machine it is running on
  • If the service is running on the non default port, make sure you include the port number

Some Notes About SQL

Whilst creating SPN’s for most service’s is pretty straightforward, SQL can be a bit of a nightmare. To try and avoid the issues, look to follow these tips.

  • If you're going to create port based SPN’s, make sure you disable dynamic ports in SQL, else you might find delegation works fine until your first restart of SQL server.
  • If your are using a named instance, make sure you include the instance name, or the port number in the SPN.
  • Unless you have a very simple setup, run SQL as a designated service account, not the system account. Whilst running as the system account will create SPN’s for you, if you have clients in other domains or forests it can cause problems.

Some Notes About IIS

If you're using a domain service account for running your application pool, you need to set the option to “useAppPoolCredentials” to true, so that we use the application pool account as part of the delegation process:

  1.       Open IIS Manager.
  2.       Expand the server and then ‘Sites’, then select the application
  3.       Under Management, select ‘Configuration Editor’.
  4.       In the ‘From:’ section above the properties, select ‘ApplicationHost.config <location path=…’
  5.       For the ‘Section:’ location, select system.webServer > security > authentication > windowsAuthentication.
  6.       In the properties page, set useAppPoolCredentials to True, then click Apply

Enabling Delegation

Once you’ve got all the SPN’s setup, it’s a simple step to enable constrained delegation in Active directory, Microsoft explain the process quite clearly here - http://technet.microsoft.com/en-us/library/cc756940(v=ws.10).aspx but as always, there are things that can go wrong:

  • If the service your select to allow constrained delegation to is running as a service account rather than system, make sure you select the service account when you are setting delegation up, not the machine.
  • Where there are multiples of the same service, but on different ports - like SQL, make sure you pick the right one.

Troubleshooting Tools

Ultimately, you might follow all these tips and still end up with delegation not working, which can be a very frustrating experience. There are some tools that can help diagnosing these issues easier:

  • DelegConfig - Brian Booth’s tool for debugging kerberos delegation when using IIS, will run you through a wizard and show you a report telling you what is and isn’t setup correctly and whether delegation will work. It can even fix some of your issues given the right permissions.
  • KList/Kerbtray - If you're using Server 2008 R2, the new Klist tool can be used to view kerberos tickets and diagnose what delegation is taking place. If you're using an older OS you’re stuck with Kerbtray, which can be found in the 2003 Server resource kit (this works on 2008 server as well).
  • Event Viewer - Use the security log in event viewer to get more detailed error messages on the delegation failures

Final Thoughts

Whilst getting double hop delegation can seem a bit of an arduous process, so long as you bear in mind it’s very strict requirements it can become less of a dark art and more of a process to follow. Hopefully it’s not something you’ll need to do too often, but if you do, I hope these tips make things a little less painful.

Further Reading

December 20, 2012

Day 20 - Data-Driven Firewalls

This was written by Zach Leslie.

During his keynote presentation at PuppetConf 2012, Tim Bell said something about the way in which machines can be classified that stuck with me:

The service model we've been using ... splits machines into two forms. The pets: these are the guys you give nice names, you stroke them, you look after them, and when they get ill, you nurse them back to health, lovingly. The cattle: when they get ill, you shoot them.

What follows is my attempt to explain the impact those words have had on the way I think about firewall management.

Firewalls As Pets

Managing firewalls has always felt like caring for those pets. By their nature, firewalls are uniquely connected to several networks - deciding what traffic should be allowed to pass. No other role in the network can make those decisions quite like the firewall due to its strategic placement in the architecture. There is often only a single unit controlling access for a specific set of networks at any given time. As such, rules about how traffic should be handled are specific to each firewall placement or cluster.

Perhaps you manage one with a web interface, or if you are lucky, execute commands at a shell, upload data, etc. You take backups and hope nothing goes sour with the hardwarek right?

Like many people, I like reusing my previous work to gain more benefit from my efforts than I had originally intended. It's the force multiplier. If I write a method that does X, I should be able to apply X in lots of places without much additional work. When dealing with a unique box, like a firewall, I've found myself stuck with very little reuse of my efforts; one unit of gain for one unit of effort, etc. Web interfaces don't usually lend themselves to help understand why rules are in place, or why the configuration is just so. If there is a comment field for a firewall rule, you might be able to drop a link to a ticket tracker to provide you some external information, but even then you might loose historical context. Certainly, configuration management alone does not gain you this context, but if you store your configurations and code in a revision control system, then you can look through the history of changes and understand the full picture, even if the lifetime spans years. Reviewable history with full context is fantastic!

Configuration management just extends mentality that software development has had for years: libraries of reusable code save time and effort.

As I spend more time with the discipline of configuration management, I find myself wanting to treat more machines like cattle, so I can rebuild at will and know with reasonable certainty what exactly has been done, and why, on a given system to get it into its current state. If you have data that is consumed by various parts of the infrastructure and you want to make a change, you need only manipulate that data to make it so. In this way, you force yourself to consider the architecture as a whole, and apply consistency throughout, leaving you with a cleaner and more maintainable infrastructure.

Refactoring is not just for the Devs anymore

Refactoring has the benefit of allowing you to take the knowledge that you learned the first time around and apply that knowledge as you move forward. The requirements for your infrastructure change over time, and as a result, the existing implementation of infrastructure needs to follow those changes. If you have to start from scratch with every refactor, the time required is often daunting, causing the refactor work to get pushed out until is critical that the work be done.

What does it mean to be data-driven?

For a long time, I thought that all the talk of data driven was just for the big cloud providers or large enterprise, but data-driven infrastructure is just a mechanism to distill your configuration into reusable pieces - something we do with configuration management already. In Puppet, for example, modules are the resusable units.

When I think of a data-driven infrastructure, I think of constructing a model for how I want the systems involved to behave. Your model can include common data as well as unique data All systems might have network settings like address and vlan, but, for example, only some have backup schedules.

Building The Model

Let's construct model starting with the common elements. I've been building all of my data models in YAML since its easy to write by hand and easy to consume with scripts. Eventually, we may need some better tooling to store and retrieve data, but for now, this works. We'll start with a network hash and tack on some VLANs.

---
network:
  vlans:
    corp:
      gateway: '10.0.200.1/24'
      vlan: 200
    eng:
      gateway: '10.0.210.1/24'
      vlan: 210
    qa:
      gateway: '10.0.220.1/24'
      vlan: 220
    ops:
      gateway: '10.0.230.1/24'
      vlan: 230

Now we can load the YAML data and take some action. I'm loading this data with hiera into a hash for Puppet to use. Putting the above YAML data into a file called network.yaml, you can load it up with a hiera call like so:

hiera('network',{},'network')

This will look for a key called 'network' in a file called 'network.yaml' with a default value of an empty hash, '{}'. Now you can access your data as just a Puppet hash.

We use FreeBSD as our firewall platform and PF for the packet filtering. Since we use Puppet, I've compiled some defined types to create the VLAN interfaces and manage parts of FreeBSD. (All of this all of which is up on GitHub.

Our firewalls also act as our gateways for the various VLANs, so we can directly consume the data above to create the network interfaces on our firewalls.

#
# Read in all network data for variable assignment
$network  = hiera_hash('network',nil,'network')
$vlans    = $network[$location][vlans]

# Create all the VLANs
create_resources('freebsd::vlan', $vlans)

One function that Puppet provides to inject the VLAN interface resources into the Puppet catalog is create_resources(). If the parameters on the hash match exactly the parameters the defined type expects as in the case above, this works wonders. If not, you'll need to create a wrapper that consumes the hash to break it into its various pieces to hand out to more specific defined types.

Now lets model the network configuration of the firewall.

---
network:
  firewall:
    laggs:
      lagg0:
        mtu: 9000
        laggports:
          - 'em1'
          - 'em2'
    defaultrouter: 123.0.0.1
    gateway_enable: true
    ipv6: true
    ipv6_defaultrouter: '2001:FFFF:dead:beef:0:127:255:1'
    ipv6_gateway_enable: true
    ext_if: 'em0'
    virtual_ifs:
      - 'gif0'
      - 'lagg0'
      - 'vlan200'
      - 'vlan210'
      - 'vlan220'
      - 'vlan230'
    interfaces:
      em0:
        address: '123.0.0.2/30'
        v6address: '2001:feed:dead:beef:0:127:255:2/126'

Much of the above is FreeBSD speak, but each section is handed to different parts of the code. Here we build up the rest of the network configuration for the firewall.

$firewall    = $network[$location][firewall]
$virtual_ifs = $firewall[virtual_ifs]
$ext_if      = $firewall[ext_if]
$interfaces  = $firewall[interfaces]

class { "freebsd::network":
  gateway_enable      => $firewall[gateway_enable],
  defaultrouter       => $firewall[defaultrouter],
  ipv6                => $firewall[ipv6],
  ipv6_gateway_enable => $firewall[ipv6_gateway_enable],
  ipv6_defaultrouter  => $firewall[ipv6_defaultrouter],
}

Now we create the LACP bundle, ensure that the virtual VLAN interfaces are brought up on boot, and set the physical interface address properties.

create_resources('freebsd::network::lagg', $laggs)

$cloned_interfaces = inline_template("<%= virtual_ifs.join(' ') %>")
shell_config { "cloned_interfaces":
  file  => '/etc/rc.conf',
  key   => 'cloned_interfaces',
  value => $cloned_interfaces,
}

create_resources('freebsd::network::interface', $interfaces)

The Puppet type shell_config just sets a key value pair in a specified file, which is really useful for FreeBSD systems where lots of the configuration is exactly that.

Now that we have network configuration for the firewall, lets do some filtering on those interfaces. In the same spirit as before, we'll look up some data from a file and use Puppet to enforce it.

For those new to PF, tables are named lists of address or networks, so you can refer to the names throughout your rule set. This keeps your code much cleaner since you can just reference the table and it expands to a whole series of addresses. PF macros are similar but more simple. They are just key value pairs, much like variables. They are useful for specifying things like $ext_if or $office_firewall that can be used all over your pf.conf. The data blob for 'pf' might look like this:

pf:
  global:
    tables:
      bogons:
        list:
          - '127.0.0.0/8'
          - '172.16.0.0/12'
          - '192.168.0.0/16'
      v6bogons:
        list:
          - 'fe80::/10'
      internal_nets:
        list:
          - '10/12'
          - '10.16/12'
      dns_servers:
        list:
          - '10.0.0.10'
          - '10.0.0.11'
          - '10.2.0.10'
          - '10.2.0.11'
      puppetmasters:
        list:
          - '10.0.0.20'
          - '10.0.0.21'
          - '10.0.0.22'
          - '10.0.0.23'
          - '10.0.0.24'
      puppetdb_servers:
        list:
          - '10.0.0.25'
  dc1:
    macros:
      dhcp1:
        value: '10.0.0.8'
      dhcp2:
        value: '10.0.0.9'
    tables:
      local_nets:
        list:
          - '10.0.0.0/24'
      remote_nets:
        list:
          - '10.2.0.0/24'
  office:
    macros:
      dhcp1:
        value: '10.2.0.8'
      dhcp2:
        value: '10.2.0.9'
    tables:
      local_nets:
        list:
          - '10.2.0.0/24'
      remote_nets:
        list:
          - '10.0.0.0/24'

While the Puppet configuration might look like this:

include pf
$pf = hiera_hash('pf',{},'pf')

$global_tables = $pf[$location][tables]
create_resources('pf::table', $global_tables)

$location_tables = $pf[$location][tables]
create_resources('pf::table', $location_tables)

$global_macros = $pf[$location][macros]
create_resources('pf::macro', $global_macros)

At this point, we have configured the FreeBSD gateway machine attached to a few networks. We have the PF configuration file primed with tables and macros for us to use throughout our rule set, but we aren't doing any filtering yet.

I've thrown in some data give you an idea of the possibilites. If you don't want all ports reachable on your DNS and DHCP servers, you can use the tables and macros above to do some filtering so that only the required ports are available from other networks.

An Example Implementation

Now that we can build firewall resources with Puppet, this opens the doors for all kinds of interesting things. For example, say we want to open the firewall for all of our cloudy boxes so they could write some metrics directly to our graphite server. On all of your cloud boxes you might, for example, include the following code to export a firewall resource to be realized later.

@@pf::subconf { "graphite_access_for_${hostname}":
  rule  => "rdr pass on \$ext_if inet proto tcp from $ipaddress to (\$ext_if) port 2003 -> \$graphite_box",
  tag   => 'graphite',
  order => '32',
}

In the code that builds your graphite server, you might also add something like this to ensure that the rest of your pf.conf can make use of the macros.

pf::macro { "graphite_box": value => $ipaddress; }

Then on your firewall, you can just collect those exported rules for application.

Pf::Subconf <<| tag == 'graphite' |>>
Pf::Macro <<| |>>

Now all of your cloud boxes are able to write their graphite statistics directly to your NATed graphite box, completely dynamically.

One Caveat

This method still requires that you know the syntax of PF. It also requires that you know which macros to escape in the rule string and which to interpret as Puppet variables. Personally, I am okay with this because I like the language of PF. Also, I don't know if the code complexity required to abstract the PF language is worth the effort. The pf.conf is very picky about the order of rules in the file. This complicates the issue even more.

Conclusion

I think my next steps will be to create some helper defines that will know how to work with the order of PF. Rather than have just one pf::subconf, perhaps there will be many, like pf::filter, pf::redirect, etc. Also, I'd like my switches to be consuming the same data so that I can ensure consistency across all of the devices involved in the network.

In talking to people about these concepts, I have come to think that this is yet a solved problem when dealing with configuration management. What I have talked about above is in relation to an example hardware firewall, though I believe the problems this attempts to solve plague the host firewalls as well as virtual packet filtering and security zones in virtual networks.

All of the code for this experiment is on GitHub. I'd love to hear how others are solving these kinds of issues if you have some experience and wisdom to share.

Happy filtering.

Further Reading

December 19, 2012

Day 19 - Modeling Deployments on Legos

This was written by Sascha Bates.
When deployment scripts grow organically, you typically end up with a brittle, poorly documented suite understandable only by the original authors or people who have been working with them for years. The suite often contains repetitive logic in different files, exacerbated by code offering little in the way of documentation or understanding. There are probably several sections commented out, possibly by people who no longer work there as well as several backup files with extensions like .bk, .old, .12-20-2004, .david .david-oldtest. 

This code has probably never been threatened with version control.  It's easy for bugs to lurk, waiting for just the right edge case to destroy your deployment and ruin your night. Deployments are known to be buggy and incident-prone. The natural result of this situation is that all deployments require an enormous spreadsheeted playbook requiring review by a large committee, a monster conference bridge with all possible players and problem solvers online "just in case" and probably overnight deployments in order to reduce impact.  Regardless of root cause, the Ops team probably receives the brunt of abuse and may be considered "not very bright" due to their inability to smooth out deployments.

When you are mired in a situation like this, it's easy to despair. Realistically speaking, it's probably not just the deployment scripts. Deployments and configurations probably vary wildly between environments with very little automation in place. The company probably has canyon-sized cultural divides and a passion for silos.  You've already had the sales pitch on configuration management and continuous integration testing and, while they are critical to systems stability, I'm not here to talk about them. I'm here to talk about Legos!
"LEGOS: interlocking plastic bricks and an accompanying array of gears, mini figures and various other parts. Lego bricks can be assembled and connected in many ways, to construct such objects as vehicles, buildings, and even working robots. Anything constructed can then be taken apart again, and the pieces used to make other objects."
Instead of looking at your deployment as just a linear progression, consider it a collection of interchangeable actions and states. A way to address the quagmire of scripts is to replace them inch by inch with a framework.  And it’s not so much what you use to create the framework, but how you ensure flexibility and extensibility.  Your deployment framework should be like a Lego kit: a collection of interlocking building blocks that can build something yet be disassembled to build something else.

Building blocks, not buildings: A good deployment system should be composed of a frame of flexible, unambiguous code snippets (bricks).  These could be Bash, Ruby, Puppet or Chef blocks or components from a commercial pipeline tool. Configuration management and pipeline tools have a built in advantage of already providing this logic and idempotence.

A brick should do one thing: manage services, copy a file, delete a directory, drain sessions from an application instance. Bricks should connect at action points based on external logic, which brings me to my next point, separation of duties.

Actors and Actions and a Central Authority: In order to keep deployment logic from bloating our bricks, orchestrators and actors should be separate. I think this is the most difficult requirement to achieve, which is why we end up with bloated, brittle deployment scripts and 90 line spreadsheeted playbooks.  Deployments are not just moving files around and restarting Java containers. Deployments can contain database updates, configuration file changes, load balancer management, complex clustering, queue managers, and so much more. Start and stop order of components can be critical and verification of state and functionality must often happen prior to continuing.

For example, I may need to set Apache to stop sending traffic to my A cluster and I need to verify there are no active sessions in the A cluster application instances prior to continue.  If I can make a decision on the state something should be in, I should not also be the executor. You should have a central command and control that understands the desired state of all components during a deployment, which should be making decisions based on the state of the entire system. There are tools for this. Capistrano is famous for its ability to do much of this. Rundeck can manage multiple levels of orchestration and there are several commercial tools. 

Visible, Understandable Flow: Your tool should be able to display workflow in such a way that it doesn’t take a genius to understand what they’re looking at. While the language or implementation you are using should not matter, your implementation style should.  You should not have to be a genius in the app to understand the process flow when looking at a diagram - ie the flow should be obvious from looking at the tool. This is a place where I find Jenkins and other traditional build tools really fall short.  


If you install enough plugins into Jenkins, it will twist itself around and try to do anything you want. The trouble with that is it become impossible to follow the pipeline and decision making process.  Jenkins is good for pushing a button to start a job, but it generally has no view into your application servers or web servers. If you set it up to make orchestration decisions, it's easy to get lost in the pipeline. I've often met people who want to kick off remote shell scripts with Jenkins as part of a deployment and I continue to object to this because you pretty much just launched a balloon off a cliff. You don't know what's happening on the other end and Jenkins will never know unless that node has a Jenkins agent.  

At this point you've removed the central management and decision making process from your deployment orchestration.  I will continue to maintain that Jenkins makes a fantastic build and continuous testing server, but is not meant for more comprehensive orchestration.

Communicative Metrics and Alerts: Dashboards are critical for communicating to stakeholders.  Stakeholders are the appdev team, operations, other teams impacted by deployments, managers, business owners, your team mates. The more you give people pretty pictures to look at, the less likely it is they’ll make you sit on an enormous conference bridge at deployment time



Your system should have the ability to collect metrics, trigger alerts and display dashboards. I don’t necessarily mean that the tools should come with built in dashboarding and reporting, although if you paid for it, it should. If it doesn’t have built-in dashboards, it should have pluggable options for metrics collection and alerting to something like nagios which will allow you to design some dashboards.
 

Test Execution: Once you have your reporting and metrics collection in place, the first thing you should be looking at is execution of tests and metrics collection on pass/fail status. The deployment framework should be able to incorporate test results into its reporting. 

Pretty and Easy to Use: The deployment framework should be compelling. People should want to use it. They should be excited to use it. You need a simple UI that is pretty with big candy-colored buttons. It should be communicative with real time updates visible. This will often be one of the last things you acheive, but I consider it critical because it helps move deployments out from under just the person(s) who designed the process to entire teams who need to use it.



Source Control and Configuration Management: Of course, with all of this, your framework should be checked into source control. You should comment it heavily. Do I really need to say more about that?  You should deploy to systems already under configuration management. If you are deploying to systems still configured by hand, your deployment success will be at risk until you standardize and automate. 


You're Not In This Alone: It's hard work sometimes, living in chaos. You're not alone. Many of us have been there and the reason I can write about this so much and have so many opinions is because I've seen it done wrong throughout my career. Either because of time and people constraints, lack of concern for the future (we'll scale when we get there) or ignorance, many of us have encountered and even perpetuated less than optimal scripting suites and spent miserable hours on deployment firefighting bridges. 

Open source tooling and communities are all over the internet. Get an IRC client and pop into one of the freenode chat rooms for #puppet, #chef, #capistrano, ##infra-talk and ask questions. People are friendly and often want to help.  Find your local DevOps or tooling meetup and go talk to people.

Existing Tooling: And finally, you don't have to roll your own. Like Legos, there are many custom kits and how-tos for existing kits. There's no reason not to build on the shoulders of giants instead of starting from scratch. Here are some examples of open source tools specifically for deployment or general orchestration.