Beer and Pizza with Facebook

https://jamesbetteley.wordpress.com/2012/04/19/beer-and-pizza-with-facebook/

Last night I was invited to go along to the Facebook offices in London and attend a tech talk on how Facebook do release engineering and automated testing.

Now, when you go along to meetups & tech talks they often give you free pens, magazines and sometimes free beer. These freebies are bribes to make you enjoy the evening and think favorably of the content. I would never allow myself to be influenced by such things, and as such my blogs are guaranteed to be 100% impartial. Honestly. Right, that’s that done, now on with the tech-talk…

Pint of Spitfire

The first thing I did was go to the bar to collect my free beer. The choice was great, there was wine for the ladies, lager for the men, bitter for the real men, and soft drinks for, er, others. And you get your beer in a proper pint glass too. So an excellent start to the evening.

I took my seat on a very comfortable sofa and sat back, waiting for the talk to begin. Then the snacks started arriving. They were brought round by waitresses in black uniforms, so they sort of looked like ninjas. I’m not sure that was the intention though. Anyway, the snacks were delicious. I started off with a chilli and lemongrass chicken skewer. Yummy.

No sooner had I finished my chicken skewer than Girish Patangay, a Facebook release engineer, started his talk on how they do deployments to Facebook.com.

The first thing I noted was that they don’t do continuous delivery. I think I know why, and I’ll explain about that later.

Girish emphasized how important the culture is at Facebook, and explained that “ownership and impact” are very important there. This means that developers take full ownership of their changes/code and they have to have full awareness of impact of their changes. He described the developers as “shepherds” of the code, in that they look after their changes from the moment they’re checked in, to the moment they’re pushed to production. They are also responsible for testing their changes because Facebook “don’t have a QA team” as such. It sounds like the devs are responsible for coming up with the tests and writing them. I wondered if these included Acceptance Tests, and if so, where are the acceptance criteria coming from?

Being able to shepherd your code into production is made much easier by the quick turnaround time from code commit to production push. The longest anyone would have to wait is 1 week, but mostly it’s a lot quicker than that. There are daily pushes every day, and 1 weekly push.

Branching

The next snack to come round was a vegetarian mini pizza, and I mean mini. I could fit the whole thing in my mouth, and it was totally delicious.

Their branching policy was pretty much the same policy as we had when I worked at uSwitch.com. They worked on main until a certain day (I think they said Sunday) when a branch was taken. From then on they work on the branch. Fixes could be deployed at any time from the previous week’s branch if they deemed them fit enough and necessary.

They also used shadow branches, which I think are the same as the latest branch plus any changes in main. The point in this is so that anyone can see the very latest merged code at any given time. I’m not sure how often this shadow branch was updated though (presumably at least daily).

Push Karma

By this point I’d finished my pint of beer, so a ninja came around and offered me another one! How awesome is that?! I also tucked in to another little snack, not sure what this one was but it looked like a mini bhajee and came with a dip. Tasty.

I loved the “push karma” thing they’ve got going on at Facebook. Basically everyone is born with a push karma of 4. If your changes repeatedly turn out to be a disaster or troublesome, your push karma goes down. If it goes down to 2 or below, you can’t get into the daily push and you have to wait for the weekly release. On the other hand, if your changes are notoriously smooth, then your push karma goes up, and the better chance you have of getting your changes into to daily push. I really love this concept and I wish I’d thought of it at uSwitch. Back in those days we were basically doing daily pushes as well as biweekly releases, and giving people “push karma” would have been a fantastic weapon for pushing back on the odd push that I knew pretty well wasn’t going to go smoothly!

Pineapple and Chilli

The next treat to come my way via a ninja was a pineapple and peanut *thing* with some chilli on top. Again this was delicious. I had two of them they were so good. I could clearly identify the pineapple, and the bit of chilli on top, but I wasn’t sure what the peanut flavored thing was. I mean, presumably it was peanut, but what kind of peanut? It was more like a peanut relish than a peanut. It certainly didn’t look like a peanut. Anyway, on with the tech talk…

At Facebook, when the staff try to access facebook.com, the staff actually access latest.facebook.com – this is the latest code, deployed onto some beta servers. This way, the staff are acting like testers. What’s particularly useful about this is how easy they have made it for users to report bugs. You can even assign them to individual devs. I think it’s this “usability” which is lacking in most places. Many of us can access demo sites etc but actually capturing and reporting defects really isn’t a click-of-a-button thing, and it’s this barrier which Facebook have tried to overcome. I would love it if I could access my latest system that easily, and report a bug simply by clicking a button on the same site.

How Facebook Do Deployments

As Girish started talking about the actual technical details of how Facebook do their deployments, I tucked into a duck spring roll and my third beer. This time I was drinking becks or something similar, which I swiped from a passing ninja.

About 4 years ago, Facebook did deployments using rsync, and so did I! In fact, I know a few places that still do deployments using rsync. It took about an hour for Facebook to deploy their whole site. These days they’ve got about 100 times more servers to push to, and they can do it in minutes. How??

They wouldn’t say.

Just kidding. I’ll get to that in a sec, first they explained some approaches they considered, and why they discounted them. I should at this point mention that they deploy their entire webserver code, rather than just small parts of it in each push. This, in my opinion, is probably why they aren’t doing continuous deployment or continuous delivery. The release of the site is a 1.5Gb binary. So, they looked at binary diffs, but just aren’t that quick, and they looked at multicast, which turned out to be very complicated and a cross-datacentre configuration nightmare. They also looked at peer to peer rsync or scp, but that wasn’t working for them.

What they settled on, as Girish explained while I had another chilli and lemongrass chicken skewer (definitely my favorite), was a torrent push, and I must confess I love this idea.

It works like this, you install torrent clients on your servers, and create a torrent file. Then you simply deploy your torrent to one peer and sit back and admire your work as the peer to peer sharing gathers pace. Absolutely brilliant. I’m so annoyed I didn’t think of this as well.

torrent diagram from http://torrentfreak.com

Their solution was based on opentracker and hrktorrent, and allowed them to push a 418Mb gzip file to 10,000 servers in just 58 seconds, which is roughly the equivalent to 563Gbps!!

Testing

Earlier on they said they don’t have a QA team, so when one of their testers, Damien Sereni, came up to give his talk, I got a bit confused. However, they explained that he is the Webdriver guy, and that he’s busy porting their old Watir tests over to Webdriver. I wondered why they were doing this, and obligingly they explained that it was because the Watir code was very separate from the site code and that webdriver allowed them to keep their code together better. I’ve used Watir and webdriver and I can understand what he means, even though it might not sound like a brilliant idea for such a switch.

Facebook use Selenium grid & webdriver hub to scale their tests and speed them up. This allows them to distribute their tests to multiple environments and parallelize their test execution.

This is all pretty easy when you’re testing on computers but it it gets a bit tricky with mobile phones. Back in the day, when the facebook app was separate to the site, it was a pain to deploy and a pain to test. Also you hgad to deal with Apple quite a lot, so you couldn’t really take control of when and how you did deployments. Nowadays the facebook app just renders the website so things are a little different (i.e. easier). That said, automated testing for mobile, and sharing UI tests across platforms remains one of the biggest challenges at Facebook.

Post-Talk Drinks

It would have been rude to leave without collecting my free T-shirt and Facebook-embossed pint glass, so I stuck around until the end of the talk and took the opportunity to chat with some of the Facebook engineers. One guy explained how they did roll-backs (by keeping the old code on the site and repointing a symlink) and another guy explained how they manage schema changes (by keeping the schema really really simple, and abstracting). Also, I took the opportunity to speak with one of the ninja waitresses and asked her what was in the pineapple and peanut snack. The answer: Pineapple and peanut. I had a halloumi cheese skewer (delicious) and left.

Advertisement

Greasemonkey script for CI system

Here in Caplin Towers (it’s not really called that) we’ve got a couple of projectors displaying the Continuous Integration builds up on the walls. It’s pretty useful until you get to the point where you’ve got more projects than space on the wall. We got to that point a while ago, and have had to resort to only displaying the “most important” builds on the wall. Clearly this is not very cool, because all the builds are important.

Sorry, there's no room here!

Sorry, there's no room here!

I decided to write a script which would scroll through all of the build groups and display this on the wall. I worked out that it would take about a minute to scroll through the whole lot, with a 4 second pause on each build group. My first thought was to use Watir (a ruby based browser scripting tool), and this would have probably worked fine on a Jenkins, Bamboo or CruiseControl system, but not for Go (I needed my solution to work for Go as many of our builds are in this system at present). You see, Go displays build groups by use of “views” (like Jenkins does). Unfortunately in Go there isn’t a different url for each view, meaning I can’t just write a simple ruby script that loads up a different page for each build group. I guess it must be handled by javascript.

So, I decided to try selenium. In theory this should have worked fine, and indeed it would have if I could be bothered to spend a bit more time on it. My plan was to record a journey which loaded up each view, one after another, and then play back this journey using selenium RC so that I could put it into a scheduled cron job and have it run over and over again. Like I said, in theory it works fine, but in practice it wasn’t such a great idea afterall. Firstly, there’s always that delay as selenium initializes and loads the browser, then there’s the presence of the selenium window, and then there’s the problem of having to update the script every time a new build group is added. I know most of these issues can be overcome fairly easy, especially if you’re selenium savvy or if you have a java framework for laoding and running selenium tests in place. I was just about to go down the route of writing my journey in java (mainly so that I can manipulate the window sizes more easily), when my colleague Edmund Dipple, said “I saw you struggling, so I’ve knocked this up” and showed me a greasemonkey script which does exactly what I was looking for. 🙂

Basically the script runs through each pipeline group, one after the other, and pauses for 5 seconds on each one before moving on. Perfect. He used the chrome developer tools (or you could use Firebug on Firefox) to find out the name of the pipeline group container (which turned out to be “pipeline_groups_container”) and then iterate through each of the child elements (the child elements represent each pipeline group). The full script is here:

var timeout = 5000;

var counter = 0;
var groups = document.getElementById(“pipeline_groups_container”).children;
var groupsLength = groups.length;

function scroll()
{

for(i=0;i<groupsLength;i++)
{
groups[i].style.display = “none”;
}
groups[counter].style.display = “block”;

counter++;

if(counter == groupsLength)
{
counter = 0;
}

setTimeout(scroll,timeout);

}

scroll();

And now we see each build group on screen, one at a time:

This is one pipeline group....

...and this is another