Infrastructure Automation and the Cloud
As I write this, I’m sitting in a half-empty office in London. It’s half empty, you see, because it’s snowing outside, and when it snows in London, chaos ensues. Public transport grinds to a complete halt, buses just stop, and the drivers head for the nearest pub/cafe. The underground system, which you would think would be largely unaffected by snow, what with it being under ground, simply stops running. The overground train service has enough trouble running when it‘s sunny, let alone when it’s snowing. And of course most people know this, so whenever there’s a risk of snow, many people simply stay at home, hence the half-empty office I find myself in.
But why does London grind to such a standstill? Many northern European cities, as well as American ones, experience far worse conditions and yet life still runs fairly normally. Well, one reason for London’s regular winter shutdown is the infrastructure (you can see where I’m going with this, right?). The infrastructure in London is old and creaking, and in desperate need of some improvement. The problem is, it’s very hard to improve the existing infrastructure without causing a large amount of disruption, thus causing a great deal of inconvenience for the people who need to use it. The same can often be said about improving IT infrastructure.
Last night I went along to one of the excellent London Continuous Delivery Meetups (organised by Matthew Skelton at thetrainline.com – follow him on twitter here) which this month was all about Infrastructure Automation using Chef. Andy from Opscode gave us a demo of how to use Chef as part of a continuous delivery pipeline, which automatically provisioned an AWS vm to deploy to for testing. It all sounded fantastic, it’s exactly what many people are doing these days, it uses all the best tools, techniques and ideas from the world of continuous delivery, and of course, it didn’t work. There was a problem with the AWS web interface so we couldn’t actually see what was going on. In fact it looked like it wasn’t working at all. Anyway, aside from that slight misfortune, it was all very good indeed. The only problem is that it’s all a bit utopian. It would be great if we could all work on greenfield projects, or start rewriting everything from scratch, but in the real world, we often have legacy systems (and politics) which represent big blockers on the path to getting to utopia. I compare this to the situation with London’s Infrastructure – it’s about as “legacy” as you can possibly get, and the politics involved with upgrading it is obvious every time you pick up a newspaper.
In my line of work I’ve often come across the situation where new infrastructure was required – new build environments, new test server, new production environments and disaster recovery. In some cases this has been greenfield, but in most cases it came with the additional baggage of an existing legacy system. I generally propose one or more of the following:
- Build a new system alongside the old one, test it, and then swap it over.
- Take the old system out of commission for a period of time, upgrade it, and put it back online.
- Live with the old system, and just implement a new system for all projects going forward.
Then comes the politics. Sometimes there are reasons (budget, for instance) that prevents us from building out our own new system alongside the old one, so we’re forced into option 2 (by far the least favorable option because it causes the most amount of disruption).
The biggest challenge is almost always the Infrastructure Automation. Not from a technical perspective, but from a political point of view. It’s widely regarded as perfectly sensible to automate builds and deployments of applications, but for some reason, manually building, deploying and managing infrastructure is still widely tolerated! The first step away from this is to convince “management” that Infrastructure Automation is a necessity:
- Explain that if you don’t allow devs to log on to the live server to change the app code, then why is it acceptable to allow ops to go onto servers and change settings?
- Highlight the risk of human error when manually configuring servers
- Do some timings – how long does it take to manually build your infrastructure – from provisioning to handover (including any wait times for approval etc)? Compare this to how quick an automated system would be.
Once you’ve managed to convince your business that Infrastructure Automation is not just sensible, but a must-have, then it’s time for the easy part – actually doing it. As Andy was able to demonstrate (eventually), it’s all pretty straightforward.
Recently I’ve been using the cloud offerings from Amazon as a sort of stop-gap – moving the legacy systems to AWS, upgrading the original infrastructure by implementing continuous delivery and automating the infrastructure, and then moving the system back onto the upgraded (now fully automated and virtualised) system. This solution seems to fit a lot more comfortably with management who feel they’ve already spent enough of their budget on hardware and environments, and are loath to see the existing system go to waste (no matter how useless it is). By temporarily moving to AWS, upgrading the old kit and processes, and then swapping back, we’re ticking most people’s boxes and keeping everyone happy.
Cloud Hosting vs Build-it-Yourself
Cloud hosting solutions such as those offered by Amazon, Rackspace and Azure have certainly grown in popularity over the last few years, and in 2012 I saw more companies using AWS than I had ever seen before. What’s interesting for me is the way that people are using cloud hosting solutions: I am quite surprised to see so many companies totally outsourcing their test and production environments to the cloud, here’s why:
I’ve looked into the cost of creating “permanent” test labs in the cloud (with AWS and Rackspace) and the figures simply don’t add up for me. Building my own vm farm seems to make far more sense both practically and economically. Here are some figures:
3 Windows vms (2 webservers, 1 SQL server) minimum spec of dual core 4Gb RAM:
- 2x Windows “Large” instance
- 1x Windows “large” instance with SQL server
- Total: £432 ($693.20)
- 3x 4Gb dual core = £455
- 1x SQL Server = £o
- Total: £455
These figures assume a full 730 hours of service a month. With some very smart time and vm management you could get the rackspace cost down to about £300 pcm. However, their current process means you would have to actually delete your vms, rather than just power them off, in order to “stop the clock” so to speak.
So basically we’re looking at £450 a month for this simple setup. Of course it’s a lot cheaper if you go for the very low spec vms, but these were the specs I needed at the time, even for a test environment.
The truth is, for such a small environment, I probably could have cobbled together a virtualised environment of my own using spare kit in the server room, which would have cost next to nothing.
So lets look at a (very) slightly larger scale environment. The cost for an environment consisting of 8 Windows vms (with 1 SQL server) is around £1250 per month. After a year you would have spent £15k on cloud hosting!
But I can build my own vm farm with capacity for at least 50 vms for under £10k, so why would I choose to go with Rackspace or Amazon? Well, there are actually a few scenarios where AWS and Rackspace have come in useful:
1. When I just wanted a test environment up and running in no time at all – no need to deal with any ITOps team bottlenecks, just spin up a few vms and we’re away. In an ideal world, the infrastructure team should get a decent heads up when a new project is on it’s way, because the dev & QA team are going to need test environments setting up, and these things can sometimes take a while (more on that in a bit). But sadly, this isn’t an ideal world, and quite often the infrastructure team remain blissfully unaware of any hardware requirements until it’s blocking the whole project from moving forward. In this scenario, it has been convenient to spin up some vms on a hosted cloud and get the project unblocked, while we get on and build up the environments we should have been told about weeks ago (I’m not bitter, honestly :-))
2. Proof of concepting – Again no need to go through any red-tape, I can just get up and running on the cloud with minimal fuss.
3. When your test lab is down for maintenance/being rebuilt etc. If I could simply switch to a hosted cloud offering with minimal fuss, then I would have saved a LOT of downtime and emergencies in 2012. For example, at one company we hosted all our CI build servers on our own vm farm, and one day we lost the controller. We could have spun up another vm but for the fact that with one controller down, we were over capacity on the others. If I could have just spun up a copy of my Jenkins vm on AWS/Rackspace then I would have been back up and running in short order. Sadly, I didn’t have this option, and much panic ensued.
The Real Cost of Build-it-Yourself
So I’ve clearly been of the mind that hosting my own private cloud with a VMware VSphere setup is the most economically sensible solution. But is it really? What are the hidden costs?
Well last night, I was chatting with a couple of guys in the London Continuous Delivery community and they highlighted the following hidden costs of Build-it-Yourself (BIY):
Maintenance costs – With AWS they do the maintenance. Any hardware maintenance is done by them. In a BIY solution you have to spend the time and the money keeping the hardware ticking over.
Setup costs – Setting up a BIY solution can be costly. The upfront cost can be over £20,000 for a decent vm farm.
Management costs – The subsequent management costs can be very high for BIY systems. Who’s going to manage all those vms and all that hardware? You might (probably will) need to hire additional resources, that’s £40k gone!
So really, which solution is cheapest?