Infrastructure Automation and the Cloud

As I write this, I’m sitting in a half-empty office in London. It’s half empty, you see, because it’s snowing outside, and when it snows in London, chaos ensues. Public transport grinds to a complete halt, buses just stop, and the drivers head for the nearest pub/cafe. The underground system, which you would think would be largely unaffected by snow, what with it being under ground, simply stops running. The overground train service has enough trouble running when it‘s sunny, let alone when it’s snowing. And of course most people know this, so whenever there’s a risk of snow, many people simply stay at home, hence the half-empty office I find myself in.

Snow in Berlin - where for some strange reason, the whole city doesn't grind to a standstill

Snow in Berlin – where for some strange reason, the whole city doesn’t grind to a standstill

But why does London grind to such a standstill? Many northern European cities, as well as American ones, experience far worse conditions and yet life still runs fairly normally. Well, one reason for London’s regular winter shutdown is the infrastructure (you can see where I’m going with this, right?). The infrastructure in London is old and creaking, and in desperate need of some improvement. The problem is, it’s very hard to improve the existing infrastructure without causing a large amount of disruption, thus causing a great deal of inconvenience for the people who need to use it. The same can often be said about improving IT infrastructure.

A Date With Opscode chef

Last night I went along to one of the excellent London Continuous Delivery Meetups (organised by Matthew Skelton at thetrainline.com – follow him on twitter here) which this month was all about Infrastructure Automation using Chef. Andy from Opscode gave us a demo of how to use Chef as part of a continuous delivery pipeline, which automatically provisioned an AWS vm to deploy to for testing. It all sounded fantastic, it’s exactly what many people are doing these days, it uses all the best tools, techniques and ideas from the world of continuous delivery, and of course, it didn’t work. There was a problem with the AWS web interface so we couldn’t actually see what was going on. In fact it looked like it wasn’t working at all. Anyway, aside from that slight misfortune, it was all very good indeed. The only problem is that it’s all a bit utopian. It would be great if we could all work on greenfield projects, or start rewriting everything from scratch, but in the real world, we often have legacy systems (and politics) which represent big blockers on the path to getting to utopia. I compare this to the situation with London’s Infrastructure – it’s about as “legacy” as you can possibly get, and the politics involved with upgrading it is obvious every time you pick up a newspaper.

In my line of work I’ve often come across the situation where new infrastructure was required – new build environments, new test server, new production environments and disaster recovery. In some cases this has been greenfield, but in most cases it came with the additional baggage of an existing legacy system. I generally propose one or more of the following:

  1. Build a new system alongside the old one, test it, and then swap it over.
  2. Take the old system out of commission for a period of time, upgrade it, and put it back online.
  3. Live with the old system, and just implement a new system for all projects going forward.

Then comes the politics. Sometimes there are reasons (budget, for instance) that prevents us from building out our own new system alongside the old one, so we’re forced into option 2 (by far the least favorable option because it causes the most amount of disruption).

The biggest challenge is almost always the Infrastructure Automation. Not from a technical perspective, but from a political point of view. It’s widely regarded as perfectly sensible to automate builds and deployments of applications, but for some reason, manually building, deploying and managing infrastructure is still widely tolerated! The first step away from this is to convince “management” that Infrastructure Automation is a necessity:

  • Explain that if you don’t allow devs to log on to the live server to change the app code, then why is it acceptable to allow ops to go onto servers and change settings?
  • Highlight the risk of human error when manually configuring servers
  • Do some timings – how long does it take to manually build your infrastructure – from provisioning to handover (including any wait times for approval etc)? Compare this to how quick an automated system would be.

Once you’ve managed to convince your business that Infrastructure Automation is not just sensible, but a must-have, then it’s time for the easy part – actually doing it. As Andy was able to demonstrate (eventually), it’s all pretty straightforward.

Recently I’ve been using the cloud offerings from Amazon as a sort of stop-gap – moving the legacy systems to AWS, upgrading the original infrastructure by implementing continuous delivery and automating the infrastructure, and then moving the system back onto the upgraded (now fully automated and virtualised) system. This solution seems to fit a lot more comfortably with management who feel they’ve already spent enough of their budget on hardware and environments, and are loath to see the existing system go to waste (no matter how useless it is). By temporarily moving to AWS, upgrading the old kit and processes, and then swapping back, we’re ticking most people’s boxes and keeping everyone happy.

Cloud Hosting vs Build-it-Yourself amazon_AWS

Cloud hosting solutions such as those offered by Amazon, Rackspace and Azure have certainly grown in popularity over the last few years, and in 2012 I saw more companies using AWS than I had ever seen before. What’s interesting for me is the way that people are using cloud hosting solutions: I am quite surprised to see so many companies totally outsourcing their test and production environments to the cloud, here’s why:

I’ve looked into the cost of creating “permanent” test labs in the cloud (with AWS and Rackspace) and the figures simply don’t add up for me. Building my own vm farm seems to make far more sense both practically and economically. Here are some figures:

3 Windows vms (2 webservers, 1 SQL server) minimum spec of dual core 4Gb RAM:

Amazon:

  • 2x Windows “Large” instance
  • 1x Windows “large” instance with SQL server
  • Total: £432 ($693.20)

Rackspace:

  • 3x 4Gb dual core = £455
  • 1x SQL Server = £o
  • Total: £455

Rackspace

These figures assume a full 730 hours of service a month. With some very smart time and vm management you could get the rackspace cost down to about £300 pcm. However, their current process means you would have to actually delete your vms, rather than just power them off, in order to “stop the clock” so to speak.

So basically we’re looking at £450 a month for this simple setup. Of course it’s a lot cheaper if you go for the very low spec vms, but these were the specs I needed at the time, even for a test environment.

The truth is, for such a small environment, I probably could have cobbled together a virtualised environment of my own using spare kit in the server room, which would have cost next to nothing.

So lets look at a (very) slightly larger scale environment. The cost for an environment consisting of 8 Windows vms (with 1 SQL server) is around £1250 per month. After a year you would have spent £15k on cloud hosting!

But I can build my own vm farm with capacity for at least 50 vms for under £10k, so why would I choose to go with Rackspace or Amazon? Well, there are actually a few scenarios where AWS and Rackspace have come in useful:

1. When I just wanted a test environment up and running in  no time at all – no need to deal with any ITOps team bottlenecks, just spin up a few vms and we’re away. In an ideal world, the infrastructure team should get a decent heads up when a new project is on it’s way, because the dev & QA team are going to need test environments setting up, and these things can sometimes take a while (more on that in a bit). But sadly, this isn’t an ideal world, and quite often the infrastructure team remain blissfully unaware of any hardware requirements until it’s blocking the whole project from moving forward. In this scenario, it has been convenient to spin up some vms on a hosted cloud and get the project unblocked, while we get on and build up the environments we should have been told about weeks ago (I’m not bitter, honestly :-))

2. Proof of concepting – Again no need to go through any red-tape, I can just get up and running on the cloud with minimal fuss.

3. When your test lab is down for maintenance/being rebuilt etc. If I could simply switch to a hosted cloud offering with minimal fuss, then I would have saved a LOT of downtime and emergencies in 2012. For example, at one company we hosted all our CI build servers on our own vm farm, and one day we lost the controller. We could have spun up another vm but for the fact that with one controller down, we were over capacity on the others. If I could have just spun up a copy of my Jenkins vm on AWS/Rackspace then I would have been back up and running in short order. Sadly, I didn’t have this option, and much panic ensued.

The Real Cost of Build-it-Yourself

So I’ve clearly been of the mind that hosting my own private cloud with a VMware VSphere setup is the most economically sensible solution. But is it really? What are the hidden costs?

Well last night, I was chatting with a couple of guys in the London Continuous Delivery community and they highlighted the following hidden costs of Build-it-Yourself (BIY):

Maintenance costs – With AWS they do the maintenance. Any hardware maintenance is done by them. In a BIY solution you have to spend the time and the money keeping the hardware ticking over.

Setup costs – Setting up a BIY solution can be costly. The upfront cost can be over £20,000 for a decent vm farm.

Management costs – The subsequent management costs can be very high for BIY systems. Who’s going to manage all those vms and all that hardware? You might (probably will) need to hire additional resources, that’s £40k gone!

So really, which solution is cheapest?

 

Advertisements

Test-Driven Infrastructure with Chef – Book Review

A while ago I ordered a copy of “Test-Driven Infrastructure with Chef” from Amazon. It must have been sometime last summer in fact. It took months to arrive, because they simply didn’t have enough copies. So when it finally did arrive, I was very excited to see if my wait was worth the, er, wait.

Written by Stephen Nelson-Smith (he of @LordCope twitter fame and author of the excellent Agile Sysadmin blog), Test-Driven Infrastructure with Chef is by no means “War And Peace”. In fact it’s pretty tiny, and looks more like a pamphlet than a book. But what it lacks in size it more than makes up for in concise content.

What I really like about this book is that it feels absolutely perfect for someone like me, and by that I guess I’m trying to say that the target audience is well thought out. It’s aimed at Developers, Ops Engineers, DevOps, Sysadmins and Release engineers, those sorts of people. It assumes you know a certain amount about your own business, and so I don’t find myself sitting there reading some really basic stuff that anyone in my line of work is bound to know already. I’ll take the Continuous Delivery book as an example – it’s a great book, but some of it is about introducing Continuous Integration! I would have thought that if you were about to embark on Continuous Delivery, the first thing you’d already be VERY comfortable with would be C.I. so why the need to cover that all over again? Besides, there are plenty of good C.I. books on that subject. Of course, I know that the Continuous Delivery book is in actual fact aimed at a much wider audience, but what I’m trying to say here is that Test-Driven Infrastructure with Chef is more targeted, and I feel it’s a much easier read for it.

Infrastructure as Code

The fundamental premise of this book, and indeed the main point of Chef itself, is that we should treat infrastructure as code. What this means is that managing, designing, deploying and testing infrastructure should be done in an analogous fashion to how we do these same things with software. The code that  builds, deploys and tests our infrastructure should be committed to source-control in the same way as the code that builds, deploys and tests our software is.

This approach brings with it many of the same principles as we have around building, deploying and testing our software, and these are listed in chapter 1. Also listed here are the advantages of treating your infrastructure as code, things such as repeatability, automation, agility, scalability, disaster recovery and very importantly, reassurance!

The book goes on to introduce the reader to Chef. Chef is an open source tool for managing, deploying and configuring infrastructure. It’s produced by Opscode – check out the website for more information. The book explains about the Chef tool, framework and API, and then goes on to give instructions on how to install it (you’ll need Ruby installed – and the book covers this, using RVM). You’ll also get an introduction to Git (and GitHub) here too, and how to install Git on Ubuntu. Incidentally, all the examples are based on an Ubuntu system, so if you follow the examples closely, it’s best to have Ubuntu, or an Ubuntu vm at hand. That’s not to say that the examples can’t be done on other systems, but I would guess that centos would require a lot more behind-the-scenes configuring, thanks to its less-than-fantastic package repositories.

Cucumber-Chef

Chapter 4 provides a nice introduction and description of Test-Driven and Behavior-Driven Development, and talks a little about the Acceptance Test automation tool Cucumber, before chapter 5 goes into some more detail about Cucumber-Chef (don’t worry if you haven’t heard about this before, the book tells you all you need to know to get started, but for now let’s just say it does for infrastructure what Cucumber does for code).

I couldn't find the logo for cucumber-chef, so here's a picture of a cucumber...

...and chef

Chapter 5 introduces us to the Amazon EC2 Web Service. It shows you how to get setup with a personal account (which is nice and easy), because you’ll need it to work through the examples that follow! This is one of the things that I like most about this book, it’s a practical guide, it’s as if the author knew (which he obviously did) that his target readers are the types of people who like to get stuck in and do stuff. Chapter 5 finishes off with instructions on installing Chef and using a couple of the built-in tasks.

Recipes, Roles and Cookbooks

It’s chapter 6 and time for a worked example using cucumber-chef. This is where we first meet Recipes, Roles and Cookbooks. First, we learn how to do cucumber style Given, When, Then Acceptance tests, and we start TDDing. This chapter really is about how to apply test-driven development to an operations solution. Requirements are gathered, Acceptance criteria are identified, Acceptance Tests are written (before we actually do any Chef scripting, as per TDD), and we follow the “Red, Green, Refactor” model until we have a working solution. In my copy there’s a glaring mis-print on page 48, where the given, when, then example ends up being “given, when, when” 🙂 Hopefully this’ll get corrected in future editions.

The final chapter underlines how managing our infrastructure as code, and applying the principles of test-driven development can help us increase our quality and reduce the usual risk associated with deploying infrastructure.

 

Conclusion

This book will serve as a great introduction to Chef for anybody who learns best via hands-on examples. But it’s much more than an introduction to Chef, it’s really about the practice of test-driven development, and about how to apply the principles of TDD to infrastructure management.

Using Chef in itself is a step in the right direction – it allows us to treat infrastructure as code, and this has so many benefits – we can version our configurations much more easily, we can store our configurations in a proper source code management tool, we can deploy configurations sensibly, and so on. And applying TDD principles to your Chef “development” obviously looks like a great idea, it brings with it all the goodness of TDD, and gets us to think in terms of requirements and acceptance criteria before we start building our systems, ensuring that what we produce is fit for purpose.

Even if you don’t follow TDD, or don’t plan to follow TDD for your infrastructure development, this book is still very much well worth the read. The hands-on approach using examples you can actually work with, is refreshing. I’ve probably learned more from working through this little book than I have from reading most other voluminous technical guides.