Should “DevOps” be a job title? Are DevOps teams an anti-pattern? Can you do DevOps within a single team? Were the moon landings staged in the Arizona desert? These are the sort of questions you’re never more than 5 feet away from at any DevOps conference or meetup. The answers, of course, are:
- Yes and no
- Don’t be silly
Everyone seems to have an opinion on the whole “DevOps Team/Job Title” question, and I’m no different, except my opinion is this:
I passionately don’t care
Let me explain:
If you have a devops team, and their job is to encourage and develop a devops culture within the organisation, then that’s perfectly fine, surely? If that team is successful and a devops culture blossoms, then you’re definitely winning. Maybe that team could subsequently change its name, but that just seems a bit pointless and overly hung-up on semantics. In this case, I don’t care what the team is called, because it doesn’t matter.
If you have a devops team, and they don’t encourage or foster a devops culture within the organisation, then you’re doing devops wrong, as simple as that. If you’re doing devops wrong then it doesn’t matter what you call a team, you’re still doing it wrong. In this case, I don’t care what the team is called, because it doesn’t matter.
I do understand the opinions of the anti-pattern crowd. Setting up a separate silo responsible for “doing devops” is completely wrong. It discourages other teams and individuals from adopting a devops attitude, as they’ll see it as someone else’s responsibility. But the problem isn’t with the existence of the team, the problem is the purpose of the team and the fact that whoever decided the team should “do DevOps” clearly doesn’t understand what DevOps is.
At the very core of the issue is an understanding of what DevOps actually is. If your CTO, (or Head of Technology or whoever calls the shots within the technology division) thinks that DevOps just means automating builds and deployments, treating Infrastructure as Code, or adopting Continuous Delivery, etc then you’ve already got a problem. DevOps is about these things (and more), sure, but it’s about everyone understanding the importance of them, and absorbing them into their culture.
I’ve previously posted about the dangers of having an obsession with Ownership and Responsibility and I think these factors can also contribute to a failed adoption of devops. Drawing up clear lines of ownership and responsibility is risky – if you get it wrong, you’re going to struggle. For instance, if you draw a clear line of ownership around “devops” and place it firmly in the domain of the devops team, then you’re not going to do devops. A single team cannot own and be responsible for a culture, it just doesn’t work like that. The lines of ownership need to be blurred or wiped out entirely. Nobody and everybody should be “responsible for” and “own” the devops culture.
Conclusions On The DevOps Team
Anyone who thinks that you can get something ingrained into an organisation’s culture by setting up an isolated group of people who are solely responsible for doing those things, is flying in the face of conventional wisdom. Even my experience suggests that this doesn’t work (and I always fly in the face of conventional wisdom, coz I’m stooopid). By all means create a DevOps Team, but don’t make them responsible for “doing devops”, that’s just wrong. Instead, give them the challenge of spreading the DevOps gospel, evangelising DevOps within the organisation and training everyone on the benefits of devops, as well as some of the tricks of the trade.
On The DevOps Job Title
Everyone’s a devops engineer these days. I’m a devops engineer, my wife’s a devops engineer, even my dog’s a devops engineer. It’s a booming industry and everyone wants a piece of it. I’m afraid we don’t have any control over this anymore, we’ve created a beast and it’s consuming everything!! Luckily it’s not the end of the world. Sure, sysadmins the world over are now rebranding themselves as devops engineers, but does it make any difference at the end of the day? If you’re hiring, you don’t just hire someone on the strength of their previous job title do you? No, you actually read their CV and interview the candidate. Good candidates will always shine through. Working out if someone’s a sysadmin or a build engineer is just a bit more hassle for recruitment agents, that’s all!
In a way I’m thankful for the devops job title. I honestly think it has helped to make the whole devops thing more popular and opened it up to a wider audience.
Should there really be such a thing as a “DevOps Engineer”? Probably not, but we’re far too late to stop it, and trying to stop it seems a bit of a waste of energy to me. Eventually “DevOps Engineer” will come to mean something more specific, but for now we’re just going to have to read a few more lines on CVs.
SAFe (the Scaled Agile Framework, with a random “e” at the end) seems to be the talk of the agile world at the moment, and as you’d expect, opinions are both strong and divided. On the one side you’ve got the likes of Ken Schwaber and David Anderson (the guys who brought us Scrum and Kanban respectively), while on the other side you’ve got, well, pretty much anyone who stands to make any money out of SAFe.
I’m working at a company in London who might be about to go down the SAFe route, so obviously I’ve had to do a bit of research into this new framework, as it could soon be directly impacting me in my capacity as an agile coach and devops ninja.
My initial questions were:
- What the hell is this and how come I’ve never heard of it?
- Why is there no mention of devops?
- Is it really as prescriptive as it sounds?
- Isn’t it a bit “anti-agile”? (recall “individuals and interactions over processes and tools”, part of the agile manifesto)
So I started looking into SAFe. There’s quite a bit of information on the SAFe website, and the “Big Picture” on the homepage is crammed with information, jargon and stuff. At first glance it seemed fairly sensible on the higher levels, but overly prescriptive on the team level. Then you start clicking on icons, and that’s when it gets interesting…
Story Sizing, Velocity and Estimating: They tell us that all teams should have the same velocity and the same sizing (i.e. 1 point in team A should be exactly the same as 1 point in team B). They also say that 1 point should equal 1 day. I find this quite interesting, as this time-based evaluation is flawed for a number of reasons. Velocity (in normal scrum) should be obtained by measurement, not by simply saying “right, there’s 5 people in the team, 5×8=40, therefore our velocity will be 40!” This pains me, because I firmly believe that each team should be allowed to work out their own sustainable velocity, based on observation and results (and applying the Deming cycle of making a change and seeing if it improves output). If we are all given a goal of x points to achieve in a sprint, all we will do as a team is fiddle our estimations so that we hit that target. That’s exactly why we don’t use velocity as a target. What am I missing here?
ScrumXP: SAFe seems to suggest that you MUST do 2 week sprints. You have no option in this. Doesn’t seem to matter if you want to have a kanban system based around a weekly release schedule. SAFe seems largely ignorant of this. Is SAFe suggesting that Kanban doesn’t work? Has anyone told David Anderson?
SAFe prescribes Hardening Sprints: These are sprints set aside at the end of every release (one every 5th Sprint), to allow you to do such things as User Acceptance Testing and Performance Testing. In Continuous Delivery we work towards making these activities happen as early as possible in the release pipeline, in order to shorten the feedback loop. We really don’t want to be finding out that our product isn’t performant a day before we expect to release it! I certainly wouldn’t encourage the use of hardening sprints in the SAFe way, instead I would encourage people to build these activities into their pipelines as early as possible. I think of hardening Sprints as a bad smell, isn’t it just a way of confessing that you don’t expect to catch certain things until the end? So rather than try to fix that situation and reduce that feedback loop, you’re kind of just saying “hey, s**t happens, we’ll catch it in the hardening sprint”.
Innovation Sprints: These happen at the same time as the hardening sprint. SAFe is suggesting that during a normal sprint we don’t have time for innovation. And that is quite often the case – but wouldn’t it be better if we actually did have sufficient time for continuous innovation, rather than actually have a dedicated half-sprint for innovation? The book “Slack” by Tom DeMarco talks about the myth of total efficiency, and suggests that by slowing down and building in some slack time, we get greater returns. This is better achieved as part of everyday practice rather than working at some mythical “total efficiency” level and then having an “innovation sprint”. The SAFe approach seems to be an easy option. Rather than taking the time to determine a team’s sustainable velocity which includes sufficient time for innovation, it suggests just saving it up for a sprint at the end of every release. Don’t forget that at the same point, the team will apparently be doing “hardening” activities, gearing up for a release, and planning the next one. For some reason I feel uncomfortable with the idea that innovation is something that should be scheduled once every 10 weeks, rather than something that should be encouraged and nurtured as part of normal practice.
The Scrum Master: SAFe has this to say about the Scrum Master:
In SAFe, we assume that the Scrum Master is also a developer, tester, project manager or other skilled individual (though not the team’s manager) who fulfills his Scrum Master role on a part time basis.
Wow, that’s some assumption. They seem to suggest that you can just take any developer, tester etc and send them on a scrum course, and hey presto, you have a scrum master. And yes, you could do this, but what sort of scrum master are you getting? They also say:
responsibilities can generally be accomplished in about 25% of the time for that team member
which I again find surprising. A Scrum Master is just one quarter of a person’s time?? Seriously? Mentoring a team, coaching individuals, removing impediments, applying the principles of Scrum, helping the team work towards a goal, leading a team towards continuous improvement – all of these things are expected of the Scrum Master in SAFe, and yet they can all be achieved in “about 25%” of a person’s time, apparently. And where does an agile coach come into this? Well, they don’t exist in SAFe. In SAFe you have SAFe consultants instead.
The Product Manager and The product Owner: These are 2 very separate, very different roles in SAFe. A Product Owner works with the Scrum Team, but doesn’t have contact with the customer. The Product Manager has contact with the customer but deals with the scrum team through the Product Owner. Also the product owner doesn’t own the product vision – that responsibility belongs to the product manager – this seems strange to me, I would have naturally thought that the product “owner” would own the product vision. So essentially we’re adding yet another link in the chain between the customer and the team. I’m struggling to see this as a good thing, when in my experience a close relationship between the business and the team has always been of great benefit.
There is no Business Analyst role in SAFe, which I find quite interesting. This role seems to have been split out into the Product Owner and Product manager roles. For instance, the PO is meant to do the Just-In-Time analysis on the backlog stories.
in SAFe, the UX Designer is NOT part of the agile team. Rather, they work “at the Program Level” (whatever the hell that means, possibly on a different floor, maybe) yet they still do the following:
- Provide Agile Teams with the next increment of UI design, UX guidelines, and design elements in a just-in-time fashion
- Attend sprint planning, backlog grooming, iteration demos
- Work in an extremely incremental way
- Rely on fast and frequent feedback via rapid code implementation
- Are highly collaborative, and…
- The UI criteria are included in the “Definition of Done” and User Story acceptance criteria
But I must remind you that according to SAFe they are NOT part of the agile team :-) Is it just me or does this come across as a bit, I don’t know, pedantic?
Pretty much the same rule applies to devops (which was included in a later version of SAFe) – devops people aren’t in the team BUT, you can simply achieve “devops” in part by:
integrating personnel from the operations team with the Agile teams
Er, ok. So they’re not part of the team but they’re integrated with the team. Riiiiight. On a plus note – it does mention “designing for deployability”, which can never be overstated in my opinion.
These are just my initial observations and I’m sure I’ll have a lot more to say on the subject as we embark on our SAFe journey. I’m hoping it’s not as prescriptive as it sounds, as I honestly don’t believe there’s a one-size-fits-all solution to adopting Agile. I very much believe that every organisation needs to go on their own journey with agile, and find out what works best for them. It’s my opinion that the lessons you learn on this journey are more important than the end result. In my experience, most organisations will have invariably witnessed a fantastic cultural shift during their gradual transition to agile, and I find it very difficult to see how a prescribed framework such as SAFe can facilitate this cultural shift.
The Agile Silver Bullet
So is SAFe really an agile silver bullet? I doubt it, but time will tell. I certainly don’t disagree with the majority of the contents of the “Big Picture” but where I do disagree, I feel very concerned, as I seem to disagree on a very fundamental level.
I would be much happier if SAFe was a lot less prescriptive-sounding. I can see SAFe being popular with larger-scale organisations with a penchant for job-titles and an unhealthy affinity for bureaucracy, I mean, it’s a framework, and they lap that stuff up! I can also see it being quite effective in those situations, after all, pretty much anything’s better than Waterfall!
I can see SAFe appealing to people who aren’t prepared to go on the agile journey, because they fear it. They fear they will fail, and they fear a lack of clarity. This framework puts nice titles everywhere, tickboxes to be ticked and nice clear processes to blindly follow. I can imagine it would be hard not to look like you’re making progress! I don’t yet trust the framework, but that could still change, but for the time being I’ve got the impression that it’s command-and-control agile, more of a tick-box exercise than a vessel for personal and organisational development.
If Devopsdays London was an office party, Patrick Dubois would be the Head of HR keeping an eye on proceedings, the guest speakers would be the live entertainment, and the Zero Turnaround guys would lacing the punch with more vodka and photocopying their body parts.
Zero Turnaround bought the fun, as evidenced by their stand which looked more like a booze section from an off licence, and this video here, which had me giggling in the back row like a naughty school kid:
Watch out Grammy Awards. Naturally, I felt I had to go and have a chat with these guys during one of the breaks and find out if I could blag any of their free booze…
“It’s all been taken, by competition winners” explained Simon from behind a massive heap of Guinness and Irish Whiskey. “We’re just, er, looking after this lot” said Neeme (Simon’s Zero Turnaround partner in crime) Yeah, sure.
No luck with the free booze, I thought maybe I could distract them with a question or two about devops, and then pinch some beer while they were deep in thought.
“So what does DevOps mean to you guys?” I asked, with half an eye on a bottle of Jamesons…
“A lot of it is about streamlining and bringing value to the customers sooner” Said Simon Maple, Tech Evangelist at ZT. And I suppose that’s actually what it’s ultimately about. Bridging the Dev and Ops gap is really about making us work as a team more smoothly, which leads to us being able to get stuff to our customers more quickly and more reliably. “But of course you need to right tools to do it – you have to automate across Dev and Ops” He added “This helps us remove bottlenecks”, which reminded me about my bottle of whiskey objective, so I asked another question by means of a diversion…
“Do you see a big crossover between DevOps and Agile” I asked. “Yes, DevOps is sort of like Agile breaking out of Dev and into Ops, but you can do devops stuff without being Agile” which again is true, but I imagine if you’re truly agile as a business, you’re almost certainly going to be doing “devops”.
Next, I decided to chuck in a question that caused a whole heap of opinionated discussion at devopsdays, namely “Is DevOps a job title?” (correct answer of course is – “what does it matter what your job title is? You can call yourself the Prince of Darkness for all I care. What matters is getting stuff done”), but Simon went for “I don’t think so, is Agile a job title?” which also made a lot of sense.
“So” I asked, “what do you see as the main challenges to DevOps?” I thought this questions was sufficiently fluffy to warrant an equally fluffy response, during which I could maybe accidentally knock one of the cans of Guinness into my bag…
“Well DevOps has well and truly gone viral now, so our main challenge is to make sure we do it right, or it’ll get a bad name” said Simon, as if he’d heard that particular question a million times before. No luck with the Guinness there then.
After about 5 minutes of chatting I was no closer to bagging myself any free beer than I was when I turned up, but I was having a good chat, and eventually we got around to talking about their products, JRebel and (the one I was more interested in) Live Rebel. I’d already seen their demo and it looked pretty cool to me. I’ll trial JRebel soon and put a review up here too, as I think people will be pretty impressed (I know I was), but that’s for another time. If you want to know more about their products, go see their website, duuh! For those too lazy to go to google and type in “Zero Turnaround”, here’s their website: http://zeroturnaround.com
For those even lazier than that, here’s a REALLY brief summary:
Live Rebel is a deployment orchestrator. It rolls up code and db changes into a release package and automates the delivery to whatever environment you like. It supports Java, Ruby, Python, PHP and Perl (officially) and Neeme also says it’ll work with .Net as well (not so officially). It also plugs nicely into most decent CI systems, such as Jenkins/Hudson and Bamboo (for which there are plugins) and has a one-stop management console for managing your servers and environments etc and so forth.
Anyway, I still had one final question for Simon and Neeme… “What has been your highlight of devopsdays then guys?” I asked. “Meeting the real creme de la creme of the industry and just being part of a great conference” came the reply. So, not meeting me then. Thanks guys. :-(
Simon Maple is @sjmaple on twitter and Neeme is @nemecec. Follow them at your own peril.
By James Betteley
No time for that, I’m seriously late. I was leaving the office just as someone said “James, before you go…” and that was the end of any hopes I had of getting here on time.
It’s 6:40pm, I’ve finally got my sh1t together, and we’re off! In act that’s a lie, they were off ages ago. I’m so late there aren’t even any seats, so I’m sat in the corner at the back, like a proper Billy No-mates
6:41pm: Chris is talking about big balls of mud and how they’ve gone from that, to a much smaller ball (no mention of mud this time).
6:42: Slides are going by quicker than I can type! Chris is talking about the importance of moving away from a “blame culture”. I personally hate blame culture, I think it was the French who invented it. Arf! (sorry).
6:45pm: Ok, there’s a slide on how to get to Continuous Delivery, I’m going to pay attention to this one…
It says you need:
- Cross functional product focused teams
- A focus on technical debt
- Sit the team close to their clients
- actively remove blame culture
- focus on self improvement
- radiate metrics
- collect metrics on work in progress
After a quick coffee break it’s time to interrogate the suspects – It’s Q&A time!
7:01pm: How do you share “commonality of functionality”? Asks someone who likes words ending in “ality”. Service oriented architecture was apparently a big help responds someone from the 7digital posse (they are a posse by the way Chris has been joined by some 7digital reinforcements).
7.02pm: The next question is about metrics and how they collect them. Apparently they’re working on a logging system, but I’m guessing they also use CI and some live reporting tools which I probably missed in the earlier slides! Oops.
7.05pm: “What didn’t work?” Asks someone (my favorite question so far. I’m giving it a 7.5 out of ten). Trying to patch things up didn’t work. The dependency chain caused a pain. Acceptance Tests were a pain (haha!) Using UI stuff and a shared DB caused Acceptance Test issues, they say. I nod in agreement.
7:07pm: Someone says something about keeping environments the same being a challenge. They’re meant to be the same??? Where’s the fun in that?
7.10pm: A question on blame culture is next up, namely “How do you get rid of it?” By not telling on people! Also, having a dev manager who protects from above is handy.
It’s how you respond [to blame] that’s important, as that’s what sets the tone.
7:11pm: How did you make the culture change? Asks someone who wants to know how they made the culture change. It’s another good question, and one I’d really like to learn from. It’s all very well having a great culture, and there’s no denying its importance, but how do you make a culture change if it’s not ideal to start with? Sadly the answer isn’t straightforward. The posse reply with things like “Adopt Agile principles”, “tech manifesto” (which sounds cool), “self-organising teams”, “small steps” and “leading by example”. Followed up with “hire well”, “you need champions!” “do workshops”. Also, “the CTO is pretty cool”. Hmmmm, so no “click here to change your culture” button then?
7:16pm: The next question is a corker. It went a bit like: “Usually have to change architecture of system to support Continuous Delivery, but also sometimes the architecture of organisation as well. Did this happen?” That’s the winner so far. “No” comes the answer. Damn. At 7 digital there’s a focus on lack of hierarchy, so quite a flat structure. Not much change was needed then, obviously. I think the word “culture” came up as well, and not for the first time.
7:18pm: How have they managed to integrate Ops, asks the next person, clearly fresh from devopsdays. “We’re still learning” is the honest sounding response. Not that the all haven’t been honest sounding. They’ve started assigning Ops people to “the team”, by which I assume they mean the project team.
7:20pm: Someone wants to know if they had a shared goal between tech/ops and dev? I think the answer is yes. Basically Rob (who is the head of both ops and dev) became head of both ops and dev, which helped. He also created a tech manifesto and is toying with the idea of putting up some posters. When I was a kid I had a poster of Airwolf in my bedroom. Not sure if that’s going to help anyone though.
7:21pm: “Is there a QA on the team?” is the next question. Yarp (I’m paraphrasing). But the QA person is more of a coach – everyone is expected to do it, but they’re there to lead. No separate dev manger or QA manager – everyone’s one great big team (aaahhhh).
7:23pm: Somebody has asked how they handle support, and whether there’s a support team. I think the person who responds says there’s a “Systems team”, who get a text or call in the middle of the night. It seems a bit cruel that they wait until the middle of the night to text them but what do I know? Apparently the devs may also get involved, so that’s ok. There’s an on-call team, “but this is an area for improvement” they confess. Mainly it’s a case of “call someone!”, which I personally think is pretty good. But they do stress how there’s a focus on monitoring so that they can catch as many issues as possible before they become, er, issues, if you catch my drift. They said it much better than I can write it.
7:26pm: “How frequently did you deploy your big ball of mud compared to how frequently you do it now?” And that question goes to contestant number 2. It used to be one every 3 months, but they don’t measure how frequently they can deploy stuff any more because it’s that frequent.
(that’s just showing off). Improving the deploy mechanism was all-important. And changing the culture to shift to more frequent releases. That word “culture” again.
7:30pm: This question sounds like a plant: “How do you have time to test stuff if you deploy so often?” asks some cheeky 7digital employee hidden in the audience. I’m joking of course, it’s a nice question because it leads to a well executed answer: Chris basically explains that because they deploy so often, their releases are very small. Also, they’ve automated the hell out of everything.
7:31pm: Dave asks a really good question but I’m far too slow to keep up! It included the phrase “separation of concerns” so was probably too complicated for me to understand anyway.
7:40pm: There’s a question about schema changes. I reckon the answer will include the word “culture”. it does. Somehow.
If there was a word cloud for this Q&A session then “culture” would dwarf all the others. Something tells me that “culture shift” is important.
7:45pm: “How do you manage project accounting?” “We don’t” – No mention of culture!
7:46pm: Someone asks “If there’s a production issue, like an outage, who takes ownership?”. Nice one, who indeed does take ownership? “Everyone, we have a culture of shared ownership”. Gah! it’s all about culture!
7:47pm: “How do you decide what projects get green lighted?” asks some poor innocent from the back of the room (and no, it wasn’t me). Apparently this has nothing to do with Continuous Delivery (and all the other questions have?) and there’s nobody from the product team here so that question lands on stony ground.
7:52pm: Banos treads a fine line by asking a question dangerously close to the time when we’re meant to be heading to the pub, but just about gets away with it “what CI system do you use?” he asks, and the answer is (drum roll…..) Team City! Actually there was no drum roll, I made that up. Then interestingly they say that everyone is in charge of looking after Team city and that they just trust each other! Crazyness!
8.02pm: “I was told we finish at 8” says Chris, and she’s bloody well right, there’s a pub nearby and some of us are thirsty.
So, in conclusion, 7digital know their Continuous Delivery from their TDD, and “culture” is the word of the evening. I’m off for beer with Banos, and the rest of the London Continuous Delivery gang!
Keep an eye out for #londoncd on twitter for news of the next London Continuous Delivery meetup, or go to the London CD website. Also follow the likes of @matthewpskelton, @AgileSteveSmith, @banoss and @davenolan for more Continuous Delivery goodness.
As I write this, I’m sitting in a half-empty office in London. It’s half empty, you see, because it’s snowing outside, and when it snows in London, chaos ensues. Public transport grinds to a complete halt, buses just stop, and the drivers head for the nearest pub/cafe. The underground system, which you would think would be largely unaffected by snow, what with it being under ground, simply stops running. The overground train service has enough trouble running when it‘s sunny, let alone when it’s snowing. And of course most people know this, so whenever there’s a risk of snow, many people simply stay at home, hence the half-empty office I find myself in.
But why does London grind to such a standstill? Many northern European cities, as well as American ones, experience far worse conditions and yet life still runs fairly normally. Well, one reason for London’s regular winter shutdown is the infrastructure (you can see where I’m going with this, right?). The infrastructure in London is old and creaking, and in desperate need of some improvement. The problem is, it’s very hard to improve the existing infrastructure without causing a large amount of disruption, thus causing a great deal of inconvenience for the people who need to use it. The same can often be said about improving IT infrastructure.
Last night I went along to one of the excellent London Continuous Delivery Meetups (organised by Matthew Skelton at thetrainline.com – follow him on twitter here) which this month was all about Infrastructure Automation using Chef. Andy from Opscode gave us a demo of how to use Chef as part of a continuous delivery pipeline, which automatically provisioned an AWS vm to deploy to for testing. It all sounded fantastic, it’s exactly what many people are doing these days, it uses all the best tools, techniques and ideas from the world of continuous delivery, and of course, it didn’t work. There was a problem with the AWS web interface so we couldn’t actually see what was going on. In fact it looked like it wasn’t working at all. Anyway, aside from that slight misfortune, it was all very good indeed. The only problem is that it’s all a bit utopian. It would be great if we could all work on greenfield projects, or start rewriting everything from scratch, but in the real world, we often have legacy systems (and politics) which represent big blockers on the path to getting to utopia. I compare this to the situation with London’s Infrastructure – it’s about as “legacy” as you can possibly get, and the politics involved with upgrading it is obvious every time you pick up a newspaper.
In my line of work I’ve often come across the situation where new infrastructure was required – new build environments, new test server, new production environments and disaster recovery. In some cases this has been greenfield, but in most cases it came with the additional baggage of an existing legacy system. I generally propose one or more of the following:
- Build a new system alongside the old one, test it, and then swap it over.
- Take the old system out of commission for a period of time, upgrade it, and put it back online.
- Live with the old system, and just implement a new system for all projects going forward.
Then comes the politics. Sometimes there are reasons (budget, for instance) that prevents us from building out our own new system alongside the old one, so we’re forced into option 2 (by far the least favorable option because it causes the most amount of disruption).
The biggest challenge is almost always the Infrastructure Automation. Not from a technical perspective, but from a political point of view. It’s widely regarded as perfectly sensible to automate builds and deployments of applications, but for some reason, manually building, deploying and managing infrastructure is still widely tolerated! The first step away from this is to convince “management” that Infrastructure Automation is a necessity:
- Explain that if you don’t allow devs to log on to the live server to change the app code, then why is it acceptable to allow ops to go onto servers and change settings?
- Highlight the risk of human error when manually configuring servers
- Do some timings – how long does it take to manually build your infrastructure – from provisioning to handover (including any wait times for approval etc)? Compare this to how quick an automated system would be.
Once you’ve managed to convince your business that Infrastructure Automation is not just sensible, but a must-have, then it’s time for the easy part – actually doing it. As Andy was able to demonstrate (eventually), it’s all pretty straightforward.
Recently I’ve been using the cloud offerings from Amazon as a sort of stop-gap – moving the legacy systems to AWS, upgrading the original infrastructure by implementing continuous delivery and automating the infrastructure, and then moving the system back onto the upgraded (now fully automated and virtualised) system. This solution seems to fit a lot more comfortably with management who feel they’ve already spent enough of their budget on hardware and environments, and are loath to see the existing system go to waste (no matter how useless it is). By temporarily moving to AWS, upgrading the old kit and processes, and then swapping back, we’re ticking most people’s boxes and keeping everyone happy.
Cloud hosting solutions such as those offered by Amazon, Rackspace and Azure have certainly grown in popularity over the last few years, and in 2012 I saw more companies using AWS than I had ever seen before. What’s interesting for me is the way that people are using cloud hosting solutions: I am quite surprised to see so many companies totally outsourcing their test and production environments to the cloud, here’s why:
I’ve looked into the cost of creating “permanent” test labs in the cloud (with AWS and Rackspace) and the figures simply don’t add up for me. Building my own vm farm seems to make far more sense both practically and economically. Here are some figures:
3 Windows vms (2 webservers, 1 SQL server) minimum spec of dual core 4Gb RAM:
- 2x Windows “Large” instance
- 1x Windows “large” instance with SQL server
- Total: £432 ($693.20)
- 3x 4Gb dual core = £455
- 1x SQL Server = £o
- Total: £455
These figures assume a full 730 hours of service a month. With some very smart time and vm management you could get the rackspace cost down to about £300 pcm. However, their current process means you would have to actually delete your vms, rather than just power them off, in order to “stop the clock” so to speak.
So basically we’re looking at £450 a month for this simple setup. Of course it’s a lot cheaper if you go for the very low spec vms, but these were the specs I needed at the time, even for a test environment.
The truth is, for such a small environment, I probably could have cobbled together a virtualised environment of my own using spare kit in the server room, which would have cost next to nothing.
So lets look at a (very) slightly larger scale environment. The cost for an environment consisting of 8 Windows vms (with 1 SQL server) is around £1250 per month. After a year you would have spent £15k on cloud hosting!
But I can build my own vm farm with capacity for at least 50 vms for under £10k, so why would I choose to go with Rackspace or Amazon? Well, there are actually a few scenarios where AWS and Rackspace have come in useful:
1. When I just wanted a test environment up and running in no time at all – no need to deal with any ITOps team bottlenecks, just spin up a few vms and we’re away. In an ideal world, the infrastructure team should get a decent heads up when a new project is on it’s way, because the dev & QA team are going to need test environments setting up, and these things can sometimes take a while (more on that in a bit). But sadly, this isn’t an ideal world, and quite often the infrastructure team remain blissfully unaware of any hardware requirements until it’s blocking the whole project from moving forward. In this scenario, it has been convenient to spin up some vms on a hosted cloud and get the project unblocked, while we get on and build up the environments we should have been told about weeks ago (I’m not bitter, honestly :-))
2. Proof of concepting – Again no need to go through any red-tape, I can just get up and running on the cloud with minimal fuss.
3. When your test lab is down for maintenance/being rebuilt etc. If I could simply switch to a hosted cloud offering with minimal fuss, then I would have saved a LOT of downtime and emergencies in 2012. For example, at one company we hosted all our CI build servers on our own vm farm, and one day we lost the controller. We could have spun up another vm but for the fact that with one controller down, we were over capacity on the others. If I could have just spun up a copy of my Jenkins vm on AWS/Rackspace then I would have been back up and running in short order. Sadly, I didn’t have this option, and much panic ensued.
The Real Cost of Build-it-Yourself
So I’ve clearly been of the mind that hosting my own private cloud with a VMware VSphere setup is the most economically sensible solution. But is it really? What are the hidden costs?
Well last night, I was chatting with a couple of guys in the London Continuous Delivery community and they highlighted the following hidden costs of Build-it-Yourself (BIY):
Maintenance costs – With AWS they do the maintenance. Any hardware maintenance is done by them. In a BIY solution you have to spend the time and the money keeping the hardware ticking over.
Setup costs – Setting up a BIY solution can be costly. The upfront cost can be over £20,000 for a decent vm farm.
Management costs – The subsequent management costs can be very high for BIY systems. Who’s going to manage all those vms and all that hardware? You might (probably will) need to hire additional resources, that’s £40k gone!
So really, which solution is cheapest?
Last night I was taken to the London premier of Warren Miller’s latest film called “Flow State”. Free beers, a free goodie bag and an hour and a half of the best snowboarders and skiers in the world doing tricks which in my head I can also do, but in reality are about 10000000% better than anything I can manage. Good times. Anyway, on my way home I was thinking about “flow” and how it applies to DevOps. It’s a tricky thing to maintain in an Ops capacity. It reminded me of a talk I went to last week where the speakers talked of the importance of “Flow” in their project, and it’s inspired me to write it up:
Thoughtworkers Pat and Aleksander have been working at a top secret location* for a top secret company** on a top secret mission to implement continuous delivery in a corporate Windows world***
* Ok, I actually forgot to ask where it was located
** It’s probably not a secret, but they weren’t telling me
*** It’s obviously not a secret mission seeing as how: a) it was the title of their talk and b) I just said what it was
Pat and Aleksander put their collective Powerpoint skills to good use and made a presentation on the stuff they’d done and learned during their time working on this top secret project, but rather than call their presentation “Stuff We Learned Doing a Project” (that’s what I would have named it) they decided to call it “Experience Report: Continuous Delivery in a Corporate Windows World”, which is probably why nobody ever asks me to come up with names for their presentations.
This talk was at Skills Matter in London, and on my way over there I compiled a list of questions which I was keen to hear their answers to. The questions were as follows:
- What tools did they use for infrastructure deployments? What VM technology did they use?
- How did they do db deployments?
- What development tools did they use? TFS?? And what were their good/bad points?
- Did they use a front-end tool to manage deployments (i.e. did they manage them via a C.I. system)?
- Was the company already bought-in to Continuous Delivery before they started?
- What breed of agile did they follow? Scrum, Kanban, TDD etc.
- What format were the built artifacts? Did they produce .msi installers?
- What C.I. system did they use (and why did they decide on that one)?
- Did they use a repository tool like Nexus/Artifactory?
- If they could do it all again, what would they do differently?
During the evening (mainly in the pub afterwards) Pat and Aleksander answered almost all of the questions above, but before I list them, I’ll give a brief summary of their talk. Disclaimer: I’m probably paraphrasing where I’m quoting them, and most of the content is actually my opinion, sorry about that. And apologies also for the completely unrelated snowboarding pictures.
Cultural Aspects of Continuous Delivery
Although CD is commonly associated with tools and processes, Pat and Aleksander were very keen to push the cultural aspects as well. I couldn’t agree more with this – for me, Continuous Delivery is more than just a development practice, it’s something which fundamentally changes the way we deliver software. We need to have an extremely efficient and reliable automated deployment system, a very high degree of automated testing, small consumable sized stories to work from, very reliable environment management, and a simplified release management process which doesn’t create problems than it solves. Getting these things right is essential to doing Continuous Delivery successfully. As you can imagine, implementing these things can be a significant departure from the traditional software delivery systems (which tend to rely heavily on manual deployments and testing, as well as having quite restrictive project and release management processes). This is where the cultural change comes into effect. Developers, testers, release managers, BAs, PMs and Ops engineers all need to embrace Continuous Delivery and significantly change the traditional software delivery model.
When Aleksander and Pat started out on this project, the dev teams were already using TFS as a build system and for source control. They eventually moved to using TeamCity as a Continuous Integration system, and Git-TFS as the devs primary interface to the source control system.
- Builds were done using msbuild
- They used TeamCity to store the build artifacts
- They opted for Nunit for unit testing
- Their build created zip files rather than .msi installers
- They chose Specflow for stories/spec/acceptance criteria etc
- Used Powershell to do deployments
- Sites were hosted on IIS
“Work in Progress” and “Flow” were the important metrics in this project by the sounds of things (suffice to say they used Kanban). I neglected to ask them if they actually measured their flow against quality. If I find out I’ll make a note here. Anyway, back to the project… because of the importance of Work in Progress and Flow they chose to use Kanban (as mentioned earlier) – but they still used iterations and weekly showcases. These were more for show than anything else: they continued to work off a backlog and any tasks that were unfinished at the end of one “iteration” just rolled over to the next.
Another point that Pat and Aleksander stressed was the importance of having good Business Analysts. They were essential in carving up stories into manageable chunks, avoiding “analysis paralysis”, shielding the devs from “fluctuating functionality” and ensuring stories never got stuck for too long. Some other random notes on their processes/practices:
- Used TDD with pairing
- Testers were embedded in the team
- Maintained a single branch of code
- Regression testing was automated
- They still had to raise a release request with Ops to get stuff deployed!
- The same artifact was deployed to every environment*
- The same deploy script was used on every environment*
* I mention these two points because they’re oh-so important principles of Continuous Delivery.
Obviously I approve of the whole TDD thing, testers embedded in the team, automated regression testing and so on, but not so impressed with the idea of having to raise a release request (manually) with Ops whenever they want to get stuff deployed, it’s not very “devops” :-) I’d seek to automate that request/approval process. As for the “single branch of code”, well, it’s nice work if you can get it. I’m sure we’d all like to have a single branch to work from but in my experience it’s rarely possible. And please don’t say “feature toggling” at me.
One area the guys struggled with was performance testing. Firstly, it kept getting de-prioritised, so by the time they got round to doing it, it was a little late in the day, and I assume this meant that any design re-considerations they might have hoped to make could have been ruled out. Secondly, they had trouble actually setting up the load testing in Visual Studio – settings hidden all over the place etc.
Speaking with Pat, he was clearly very impressed with the wonders of Powershell scripting! He said they used it very extensively for installing components on top of the OS. I’ve just started using it myself (I’m working with Windows servers again) and I’m very glad it exists! However, Aleksander and Pat did reveal that the procedure for deploying a new test environment wasn’t fully automated, and they had to actually work off a checklist of things to do. Unfortunately, the reality was that every machine in every environment required some degree of manual intervention. I wish I had a bit more detail about this, I’d like to understand what the actual blockers were (before I run into them myself), and I hate to think that Windows can be a real blocker for environment automation.
Anyway, that’s enough of the detail, let’s get to the answers to the questions (I’ve added scores to the answers because I’m silly like that):
- What tools did they use for infrastructure deployments? What VM technology did they use? – Powershell! They didn’t provision the actual VMs themselves, the Ops team did that. They weren’t sure what tools they used. 1
- How did they do db deployments? – Pass 0
- What development tools did they use? TFS?? And what were their good/bad points? – TFS. Source Control and C.I. were bad so they moved to TeamCity and Git-TFS 2
- Did they use a C.I. tool to manage deployments? – Nope 0
- Was the company already bought-in to Continuous Delivery before they started? – They hired ThoughtWorks so I guess they must have been at least partly sold on the idea! Agile adoption was on their roadmap 1
- What breed of agile did they follow? Scrum, Kanban, TDD etc. – TDD with Kanban 2
- What format were the built artifacts? Did they produce .msi installers? – Negatory, they used zip files like any normal person would. 2
- What C.I. system did they use (and why did they decide on that one)? – TeamCity, which is interesting seeing as how ThoughtWorks Studios produce their own C.I. system called “Go”. I’ve used Go before and it’s pretty good conceptually, but it’s also expensive and hard to manage once you’re running over 50 builds and 100 test agents. The UI is buggy too. However, it has great features, but the Open Source competitors are catching up fast. 2
- Did they use a repository tool like Nexus/Artifactory? – They used TeamCity’s own internal repo, a bit like with Jenkins where you can store a build artifact. 1
- If they could do it all again, what would they do differently? – They wouldn’t push so hard for the Git TFS integration, it was probably not worth the considerable effort at the end of the day. 1
What does this total mean? Absolutely nothing at all.
What significance do all the snowboard pictures have in this article? None. Absolutely none whatsoever.
Continuous Integration is now very much a central process of most agile development efforts, but it hasn’t been around all that long. It may be widely regarded as a “development best practice” but some teams are still waiting to adopt C.I. Seriously, they are.
And it’s not just agile teams that can benefit from C.I. The principles behind good C.I. can apply to any development effort.
This article aims to explain where C.I. came from, why it has become so popular, and why you should adopt it on your development project, whether you’re agile or not.
Back in the Day…
Are you sitting comfortably? I want you to close your eyes, relax, and cast your mind back, waaay back, to 2003 or something like that…
You’re in an office somewhere, people are talking about The Matrix way too much, and there’s an alarming amount of corduroy on show… and developers are checking in code to their source control system….
Suddenly a developer swears violently as he checks out the latest code and finds it doesn’t compile. Someone’s check-in has broken the codebase.
He sets about fixing it and checking it back in.
Suddenly another developer swears violently….
Rinse and repeat.
CI started out as a way of minimising code integration headaches. The idea was, “if it’s painful, don’t put it off, do it more often”. It’s much better to do small and frequent code integrations rather than big ugly ones once in a while. Soon tools were invented to help us do these integrations more easily, and to check that our integrations weren’t breaking anything.
Excavations of fossilized C.I. systems from the early 21st Century suggest that these primitive C.I. systems basically just compiled code, and then, when unit tests became more popular, they started running unit tests as well. So every time someone checked in some code, the build would make sure that this integration would still result in a build which would compile, and pass the unit tests. Simple!
C.I. systems then started displaying test results and we started using them to run huuuuge overnight builds which would actually deploy our builds and run integration tests. The C.I. system was the automation centre, it ran all these tasks on a timer, and then provided the feedback – this was usually an email saying what had passed and broken. I think this was an important time in the evolution of C.I. because people started seeing C.I. as more of an information generator, and a communicator, rather than just a techie tool that ran some builds on a regular basis.
Management teams started to get information out of C.I. and so it became an “Enterprise Tool”.
Some processes and “best practices” were identified early on:
- Builds should never be left in a broken state.
- You should never check in on a broken build because it makes troubleshooting and fixing even harder.
With this new-found management buy-in, C.I. became a central tenet of modern development practices.
People started having fun with C.I. plugging lava lamps, traffic lights and talking rabbits into the system. These were fun, but they did something very important in the evolution of C.I. – they turned it into an information radiator and a focal point of development efforts.
Automation was the big selling point for C.I. Tasks that would previously have been manual, error-prone and time-consuming could now be done automatically, or at night while we were in bed. For me it meant I didn’t have to come in to work on the weekends and do the builds! Whole suites of acceptance, integration and performance tests could automatically be executed on any given build, on a convenient schedule thanks to our C.I. system. This aspect, as much as any other, helped in the widespread adoption of C.I. because people could put a cost-saving value on it. C.I. could save companies money. I, on the other hand, lost out on my weekend overtime.
Static analysis and code coverage tools appeared all over the place, and were ideally suited to be plugged in to C.I. These days, most code coverage tools are designed specifically to be run via C.I. rather than manually. These tools provided a wealth of feedback to the developers and to the project team as a whole. Suddenly we were able to use our C.I. system to get a real feeling for our project’s quality. The unit test results combined with the static analysis could give us information about the code quality, the integration and functional test results gave us verification of our design and ensured we were making the right stuff, and the nightly performance tests told us that what we were making was good enough for the real world. All of this information got presented to us, automatically, via our new best friend the Continuous Integration system.
Linking C.I. With Stories
When our C.I. system runs our acceptance tests, we’re actually testing to make sure that what we’ve intended to do, has in fact been done. I like the saying that our acceptance tests validate that we built the right thing, while our unit and functional tests verify that we built the thing right.
Linking the ATs to the stories is very important, because then we can start seeing, via the C.I. system, how many of the stories have been completed and pass their acceptance criteria. At this point, the C.I. system becomes a barometer of how complete our projects are.
So, it’s time for a brief recap of what our C.I. system is providing for us at this point:
1. It helps us identify our integration problems at the earliest opportunity
2. It runs our unit tests automatically, saving us time and verifying or code.
3. It runs static analysis, giving us a feel for the code quality and potential hotspots, so it’s an early warning system!
4. It’s an information radiator – it gives us all this information automatically
5. It runs our ATs, ensuring we’re building the right thing and it becomes a barometer of how complete our project is.
And we’re not done yet! We haven’t even started talking about deployments.
Ok now we’ve started talking about deployments.
C.I. systems have long been used to deploy builds and execute tests. More recently, with the introduction of advanced C.I. tools such as Jenkins (Hudson), Bamboo and TeamCity, we can use the C.I. tool not only to deploy our builds but to manage deployments to multiple environments, including production. It’s now not uncommon to see a Jenkins build pipeline deploying products to all environments. Driving your production deployments via C.I. is the next logical step in the process, which we’re now calling “Continuous Delivery” (or Continuous Deployment if you’re actually deploying every single build which passes all the test stages etc).
Below is a diagram of the stages in a Continuous Delivery system I worked on recently. The build is automatically promoted to the next stage whenever it successfully completes the current stage, right up until the point where it’s available for deployment to production. As you can imagine, this process relies heavily on automation. The tests must be automated, the deployments automated, even the release email and it’s contents are automated.So what exactly is the cost saving with having a C.I. system?
Yeah, that’s a good question, well done me. Not sure I can give you a straight answer to that one though. Obviously one of the biggest factors is the time savings. As I mentioned earlier, back when I was a human C.I. machine I had to work weekends to sort out build issues and get working code ready for Monday morning. Also, C.I. sort of forces you to automate everything else, like the tests and the deployments, as well as the code analysis and all that good stuff. Again we’re talking about massive time savings.
But automating the hell out of everything doesn’t just save us time, it also eliminates human error. Consider the scope for human error in a system where some poor overworked person has to manually build every project, some other poor sap has to manually do all the testing and then someone else has to manually deploy this project to production and confidently say “Right, now that’s done, I’m sure it’ll work perfectly”. Of course, that never happened, because we were all making mistakes along the line, and they invariably came to light when the code was already live. How much time and money did we waste fixing live issues that we’d introduced by just not having the right processes and systems in place. And by systems, of course, I’m talking about Continuous Integration. I can’t put a value on it but I can tell you we wasted LOTS of money. We even had bugfix teams dedicated to fixing issues we’d introduced and not caught earlier (due in part to a lack of C.I.).
While for many companies C.I. is old news, there are still plenty of people yet to get on board. It can be hard for people to see how C.I. can really make that much of a difference, so hopefully this blog will help to highlight some of the benefits and explain how C.I. has been adopted as one of the most important and central tenets of modern software delivery.
For me, and for many others, Continuous Integration is a MUST.