Response to Chef in 2012

Blog Response

This is a response to this blog article: Chef in 2012: http://devopsanywhere.blogspot.com/2012/01/chef-in-2012.html

This is a great article touching on some subjects I’ve wanted to blog about. As a sysadmin that has moved in to ops there are 2 important things that struck me from the post.

Breaking down the barriers for sysadmin adoption
Dangers to chef in 2012

This post reflects my personal opinion and my experience with chef, relative to my previous and current jobs during which I moved from bare metal (and vmware) to cloud only. As noted in the background I am an infrastructure guy still new to ruby and still new to the cloud and there is a good chance there are simple ways to address some of the issues I have had with chef. Chef is the best new tool I have used in a very, very long time. For such a new tool the level of sophistication in chef continues to amaze me. I am always looking to learn more and you always learn quickly from your mistakes so love it or hate it if you have any input please comment on the post or hit me up on twitter.

Background

I’m an infrastructure guy through and through but I’ve always loved writing code. I think most sysadmins do but it’s the opposite of DevOps, its more like an SysDev. These seem to exist due to separate focus. DevOps love chef because they manage their infrastructure through code, something they already know that makes their lives easier. In Ops if you can find a tool(code) you love that eliminates your need to write any code this is great because it saves you time and makes your life easier. The crossroads are definitely that chef could be that tool.

About 6 months ago I started learning ruby and along with that chef. I moved to a new job, which is a Rails shop, but my primary reason for learning ruby was chef. Previously I was used to using python or vbs (ugh) for automation.

Testing

In the article Bryan talks about no broad consensus about how to automate the testing of chef cookbooks. I definitely agree on this point . In the past I had never done any test driven infrastructure work. Our tests were always merely monitoring after setup. Test driven infrastructure is somewhat of a foreign topic for sysadmins; you setup the software, and keep it running by monitoring it. Making sure the software is working right is typically a vendor problem and a help ticket or a bit of hacking away. Test driven development and behaviour driven development are great but some additional jargon on what this means for sysadmins would be great.

In a more traditional sense, testing your cookbooks can also be troublesome. Chef has environments which are great for different role definitions between say, staging and production, but what do you do if you have an existing cookbook (example nginx) which you are making a large change to (recompiling)? A bit of research led me to keeping version numbers on your cookbooks and making the version numbers applied static to your infrastructure. This also seems to go against community practices as if you are doing a pull request more people seem to say do not update the version number. I’ve found it easier to create test cookbooks (test_nginx vs nginx) and then apply the test cookbooks to the nodes I want to test with on a new role (test_app vs app) and removing the old role. This generally means a lot of shifting around role recipes and test merges into main recipes but it works. There is probably an easy way to do this but not something I am aware of, it’s probably a one liner in shef.

Foodcritic is a great step in the right direction but I feel there could be more infrastructure focused testing much as Bryan talks about in his article. What goes hand in hand with this is best practice. I feel this is where chef should go against the grain of ruby/rails and provide more distinct way of “this is how you do X,” or leading through example could be a great way to help others. By others I mean sysadmins :)

It really feels horrible to do a lot of testing and still have some uncertainty about cookbook pushes. If you’ve got a cookbook that’s creating a lot of files you need to easily set it up to revert if you need to. Conf.d directories are a great example of where you can be bitten if you aren’t aware that removing or changing the cookbook isn’t going to remove the file.

OS simplicity/compatibility

Bryan also talks about how things would be easier if ruby 1.9.* was included in OS distributions much as perl/python already are. Personally I have not ran in to any issue with this but have read a lot about people wishing for this. From a sysadmin world if you are using a tool normally you are going to install an agent or have prerequisites so it’s not really news here to me. It almost seems to feel “normal” to need to do a few prerequisite installs before setting up some software. Many software pieces require .net framework or redistributables or tomcat or another piece, and generally I’ve found ruby to be pretty easy to get going, although it can take a bit to compile.

The real issue I have had with this is in multiple ruby versions and management of these along with gems. This is not something you generally see in infrastructure, a software piece that runs on top of a prerequisite, which you manage with the software. In some ways this feels like poor design, the software can kill itself through a prerequisite, however there doesn’t seem to be any logical way around this and hasn’t caused me any strife yet. Standards for managing ruby and/or a opscode official cookbook for this would be a big help. This is likely an edge case as you may not need to be running multiple ruby versions unless you are running multiple apps or rails, but an issue I did have. Getting the proper gems across the OS and other versions of ruby could use some clarification context.

On another note chef on the Windows side with the latest updates from Opscode really seems to be gaining some momentum and I feel like this may be chefs major breakthrough into the sysadmin community. Having one tool to manage your entire infrastructure is monumentally important and they seem to be making great strides in windows support. I haven’t actually used the windows chef setup but coming from a primarily windows shop a tool like this could have a big impact. There are certainly some challenges with competing with the existing tools sysadmins are using today.

Getting what you need

Bryan talks about his Other Miscellaneous Technical Needs but I feel like this is a major point with chef. Bryan talks about his needs with java and how chef “needs some love” in that area and I feel like this is the case across a drastically large area of infrastructure. The cookbooks you find from opscode or from other more popular cookbook repos on github seem to be very centered towards specific needs. This is fair, people wrote these to meet specific needs, but if you are building something specific you need to be prepared to be writing your own cookbooks or do some major hacking on existing. Occasionally you’ll find a book that is a straight drop out of the box drop in and go but for the most part writing your own can actually save you time from hacking someone else’s. Compared to typical open source uses in infrastructure this varies. There is no vendor or support for the cookbook you jacked from someone’s github repo. If you need a change to the book expect to do it yourself, and even if you do don’t expect it to be default accepted into the main repository. This seems to generate a lot of cookbooks that accomplish the same things rather than a shared effort towards a central goal. With this is a great barrier into the enterprise, but at the same time this generates a level of creativity and supports what I feel like is the case across the ruby/rails community. If you need chef in the enterprise you now need an expert, which are somewhat uncommon, or labeled too new, and from a business perspective, expensive. Dropping a million dollar+ system on to something that’s “hacky” with no support is a tough sell to the dinosaurs that run a lot of enterprises. Then again, most enterprises are only now moving in to hybrid cloud so perhaps the bust through here is introducing new technologies and practices during cloud adoption.

Doing it yourself

Having a need and being motivated to move out of the old days(configure it manually) and the new age(configuration management) get you far in a lot of fields and the same goes for chef. Like others chef is an investment and the more infrastructure you have the better investment this is. Deciding to go configuration management at any size organization is a great idea, 5 servers or 5000 this is going to help you out. Moving away from traditional methodologies to chef is the harder part, and a big part of run parallel with also moving in to the cloud.

With a little bit of ruby experience you can easily get going on some cookbooks by looking at others and using ruby DSL. Following server setup, looking at a few examples and the chef wiki in a few minutes time you’ll be up and running with some basic cookbooks applied. If you only needed to modify a few files, or install a crontab, the basics will get you far.

Where things start to get more challenging is when you look at applying your existing knowledge in to this environment. Things you may hold dear to you will quickly go out the window. An example of this would be data bags. Data bags are a fantastic way to have centralized searchable data accessible to your cookbooks, but anyone that has access to chef also has access to this data. Some popular cookbooks store important information in data bags or in node attributes. Having your mysql root username and password accessible to anyone that has access to the chef server is a real bummer. Granted, all you need to do is change the cookbook to further secure your data but it seems to be general practice to store usernames and passwords in node or data bag data.

Popular cookbooks such as nagios don’t follow the existing documentation for setup and configuration. For nagios I always install from source, it’s just what I am used to and after a bunch of hacking away at the existing cookbook it’s clear that it doesn’t work with the latest nagios version. Following the cookbook along with the install guide it’s clear something is out of sync. Writing your own nagios installation cookbook might be a bit challenging as a first effort and this may be a big enough barrier to entry for some people to give up on the effort. It seems like the right thing to do would be to just fix the existing cookbook to share with the community but being new to ruby and chef this is challenging in itself.

The opscode wiki itself can be fairly confusing but has improved over the past few months and the community support seems to be better than ever. The documentation will likely need a lot of reference as you begin, and this is something I have found not to change as I continue to learn more about chef. There is little distinction in the documentation between what is chef specific and what is ruby code. Maybe this is just obvious to most. If you aren’t an expert on ruby basic methods such as File.exists used within chef resources you may have to take some time to familiarize yourself with these common functions. The examples given on the wiki seem to be omitting some basic tasks, I am guessing as these are relative to a ruby developer are obvious and merit no display. Missing example: creating a directory if it doesn’t exist.

Server Setup / Web Interface / CLI Tools

Generally I found the install process for chef-server to be really good, there were no hidden configuration changes that needed to be made and with the variety of backend products chef-server uses it almost feels like magic that they all work with the install. I considered using the hosted chef offering which is very appealing but at a startup with any reoccurring cost involved I always ask myself “Do I have to have this to get what I need?” In this case I can set it up for free and this is amazing that software this good comes for free! The only big issue I ran in to with the server setup were file permissions on the chef solr instance. The pid file and log file permissions seem to be changed on boot up of solr and the instance itself wasn’t able to boot due to it. This would cause chef to look like it was running but any searching would fail. A quick file permissions fix up and all was well until next reboot. I don’t think I ever fully resolved the issue and is still on my list of todos. Haven’t ruled it out as an OS problem yet.

The web interface for chef is great for reviewing information or for some visualization of your setup. It’s also great for some easy manipulation of recipes assigned to roles, or duplicating roles between staging and production environments. Most editing I do is through the knife interface, but that is also partly due to the web interface being buggy or not having a clear way to edit json data. I found the knife client setup to be confusing and having to reference multiple wiki pages to get setup correctly. I think a big part of this was naming across the client interface, the wiki and the server itself. Simple things like multiple names for items like validation.pem make something that’s actually very straight forward, far more confusing. Some better visualisation on a wiki page, file path documentation, and some verbiage changes would probably clear this up a great deal.

The knife client itself is amazing and I feel is the way to harness the full power of chef. Being able to edit nearly every setting via a cli and through json feels like someone finally got it right in this spectrum. Being able to import configuration directly from JSON makes me feel like I live in the future. It’s the small things right? The only major issue I ran into with knife was with my json edits. The error messages on this when saving data back to the server, such as a data bag could use more friendly errors. A lot of the time you might get what appear to be server side errors, and while this is typically due to invalid json the error message leads you to believe its a server issue. An example of this was when I was importing a coworkers ssh key in to a data bag. I had left an extra line break before the closing bracket of the JSON and this presented an odd error and I wasn’t able to save the data. Looking at the JSON all looked ok and at the time I thought it was a bug. A more friendly invalid json or json parse error would have been a lot more helpful and let me more quickly to the issue. I’m not versed well enough in chef internals to know how feasible a change like this is.

I like to relate this architecture to something like a netapp device. Fundamentally quite different but when I think about a typical newbie question of “Just starting chef should I use the web interface or knife?” the approach seems to be use whatever works for you, and the same goes for netapp. Generally I find that this is actually the best explanation but more guidance here would break some of the barrier to entry. With Netapp the web interface can serve all the basic functions but if you really want to get your hands dirty and use some of the more advanced functions you must use the cli. I would bet a large portion of smaller netapp installs are managed by the web interface only, in some cases you may never need anything more. This makes the barrier to entry in using a netapp very low for an advanced tool, but admins that want to immediately jump deep in to it have the advantage of a straight cli interface with all of the power. I would actually like the chef web interface more if some of the functionality was stripped and core functions such as JSON editor, role updates and the ability to remotely run chef client would really break some of the barriers into adoption for the lesser skilled sysadmin. Push the web interface but always reference the cli.

Documentation/Community/Support

Let’s not kid ourselves here, chef is a very advanced tool. Covering documentation across many different operating systems and thousands of popular software pieces/configurations is a huge undertaking. For what is available now the existing documentation on the wiki will get you pretty far, but you might have a bit of frustration/head scratching. Chef’s advantage however is its blazingly brilliant community and staff at opscode. For support I prefer IRC if it is available compared to help tickets/knowledgebase/etc. A lot of places offer IRC support channels but a lot of them see to be a lot of tumbleweeds. Chef’s IRC channel is great, I’d say 75% of the questions I had were answered nearly immediately in the channel, and most of the time by an opscode employee. It seems like someone is always available in the channel, and more importantly willing to help. In the current state the immediate access to information is fantastic. I remember one of the first questions I had was how do I rename a node? Simple right? As you know or can imagine web searching for chef can grant you mixed results, but being able to immediately jump into IRC after not finding an answer and learning that renaming a node simply doesn’t work really saved me time and it actually gave me confidence that it wasn’t just me struggling with the software.

The community with chef when you first start learning about it seems very small but you learn very quickly that it is pretty large and the people using it are very passionate not only about the product but the fundamentals behind why everyone should use it. A month or two ago New Relic released their server monitoring software, and for free! This I felt was a great release and significantly solidified my opinion that New Relic is hands down the leading app monitoring solution. The morning after it was released I started thinking about writing a cookbook for it. I was right in the middle of a hosting provider switch so time was stretched but much better to implement server monitoring before go live. A few minutes in I thought to myself to take a look around and see if there were any cookbooks available. Sure enough a few hours after release there was a public cookbook available for newrelic-sysmond. To me this was amazing and really showed me the power of the community. In the enterprise software field you’d be lucky to find support for new software a few months after release, let alone a few hours. I emailed the creator and said thanks, forked the repo and proceeded to implement the cookbook in my environment where it is still running today. A few weeks later I got a github notification about someone else creating a more refined newrelic-sysmond cookbook and had messaged several people with repositories asking for consolidation of work. Surprised at why I was notified as it wasn’t my repo I took a closer look and the original author had removed or made private their cookbook, so it seemed my fork was the only working version left around. If that author reads this post please put your cookbooks back up :)

Leveraging the power of the community is definitely up to opscode but given their commitment to this it looks to be a very bright future for chef in 2012. The recent big wiki/community site updates are already refreshing to see.

Changing Your Mind

Another barrier to entry I feel is the hacker/developer mind set versus the sysadmin mindset. I say versus because the goals are very different. After getting all my nodes setup I found one to be repeatedly failing and not checking in for X hours. Being a sysadmin when something goes wrong I want to know about it when it happens. The solution to this seems to be use the chef exception handler. Sending an email when a node run fails should be easy right? For a developer this is probably a few minutes work, and there are already a large range of handler plugins for chef, but for a sysadmin this is more confusing. A more common setup would be putting your smtp details into the server and it sending you notification, applying filters, etc. In some ways this is a better setup, the client notifying you directly it has an issue, through typical exception handling, however in edge cases you may find yourself without notification due to firewall rules, missing gems, or a broken configuration. These are also likely to be the ones you most want to know about. Using a gem in an exception handler means if you are missing that gem you’ve got another exception. Since you are configuring all of your nodes with chef your configuration should already be standardized so this definitely minimizes majority of issues. A developer can just look at the code, drop in some changes and all is well but us sysadmins might need an easier route, or more documentation about how to properly configure this. The wiki shows examples on how to do this, but omits some big picture on how to implement. From a developer perspective perhaps this doesn’t need any further explanation but for me after reading the wiki page I am still thinking to my self, “Ok so how do I do this.” Another tutorial linked from the page hits a 404. This makes me feel this is a lesser traveled wiki page and maybe no additional info is needed for people more savvy than me.

The ability to look into the codebase and understand it quickly the fundamentals or enough to start hacking away generally are a skill sysadmins don’t possess, especially if they are new to the language like me. Extra explanation for the less dev inclined can ease the headache and could avoid “This tool just isn’t going to work for me.” Perhaps the less savvy sysadmin looking at this technology is already a low enough percentage that this is a nil point.

Along with the look into the codebase a lot of the chef recipes do not push on infrastructure best practice. This is also mentioned above in the Doing It Yourself section but it’s important to mention it here as it is about focus. An example of this would be redis. Majority of cookbooks do not seem to account for redundancy or a larger distributed architecture, or in smaller cases a password. Why? These probably are not needed in most cases. Providing ease of access to complex configurations is a HUGE advantage chef has over other products, but also a very large undertaking. Some software pieces have thousands of configurations and providing easy access to them all is probably not realistic. Opscode’s ability to harness and build exceptionally great cookbooks, trumping all existing community cookbooks could be a big key to their success in 2012. My feeling is that I would definitely prefer drastic improvements to existing cookbooks rather than a more diverse set of cookbooks. Not to be misunderstood here, the existing cookbooks are good but it’s not farfetched to think a new community cookbook could come out that trumps it. The opscode provided cookbooks must become the standard and must hands down be the best available across the community.