Last night I was invited to go along to the Facebook offices in London and attend a tech talk on how Facebook do release engineering and automated testing.
Now, when you go along to meetups & tech talks they often give you free pens, magazines and sometimes free beer. These freebies are bribes to make you enjoy the evening and think favorably of the content. I would never allow myself to be influenced by such things, and as such my blogs are guaranteed to be 100% impartial. Honestly. Right, that’s that done, now on with the tech-talk…
Pint of Spitfire
The first thing I did was go to the bar to collect my free beer. The choice was great, there was wine for the ladies, lager for the men, bitter for the real men, and soft drinks for, er, others. And you get your beer in a proper pint glass too. So an excellent start to the evening.
I took my seat on a very comfortable sofa and sat back, waiting for the talk to begin. Then the snacks started arriving. They were brought round by waitresses in black uniforms, so they sort of looked like ninjas. I’m not sure that was the intention though. Anyway, the snacks were delicious. I started off with a chilli and lemongrass chicken skewer. Yummy.
No sooner had I finished my chicken skewer than Girish Patangay, a Facebook release engineer, started his talk on how they do deployments to Facebook.com.
The first thing I noted was that they don’t do continuous delivery. I think I know why, and I’ll explain about that later.
Girish emphasized how important the culture is at Facebook, and explained that “ownership and impact” are very important there. This means that developers take full ownership of their changes/code and they have to have full awareness of impact of their changes. He described the developers as “shepherds” of the code, in that they look after their changes from the moment they’re checked in, to the moment they’re pushed to production. They are also responsible for testing their changes because Facebook “don’t have a QA team” as such. It sounds like the devs are responsible for coming up with the tests and writing them. I wondered if these included Acceptance Tests, and if so, where are the acceptance criteria coming from?
Being able to shepherd your code into production is made much easier by the quick turnaround time from code commit to production push. The longest anyone would have to wait is 1 week, but mostly it’s a lot quicker than that. There are daily pushes every day, and 1 weekly push.
The next snack to come round was a vegetarian mini pizza, and I mean mini. I could fit the whole thing in my mouth, and it was totally delicious.
Their branching policy was pretty much the same policy as we had when I worked at uSwitch.com. They worked on main until a certain day (I think they said Sunday) when a branch was taken. From then on they work on the branch. Fixes could be deployed at any time from the previous week’s branch if they deemed them fit enough and necessary.
They also used shadow branches, which I think are the same as the latest branch plus any changes in main. The point in this is so that anyone can see the very latest merged code at any given time. I’m not sure how often this shadow branch was updated though (presumably at least daily).
By this point I’d finished my pint of beer, so a ninja came around and offered me another one! How awesome is that?! I also tucked in to another little snack, not sure what this one was but it looked like a mini bhajee and came with a dip. Tasty.
I loved the “push karma” thing they’ve got going on at Facebook. Basically everyone is born with a push karma of 4. If your changes repeatedly turn out to be a disaster or troublesome, your push karma goes down. If it goes down to 2 or below, you can’t get into the daily push and you have to wait for the weekly release. On the other hand, if your changes are notoriously smooth, then your push karma goes up, and the better chance you have of getting your changes into to daily push. I really love this concept and I wish I’d thought of it at uSwitch. Back in those days we were basically doing daily pushes as well as biweekly releases, and giving people “push karma” would have been a fantastic weapon for pushing back on the odd push that I knew pretty well wasn’t going to go smoothly!
Pineapple and Chilli
The next treat to come my way via a ninja was a pineapple and peanut *thing* with some chilli on top. Again this was delicious. I had two of them they were so good. I could clearly identify the pineapple, and the bit of chilli on top, but I wasn’t sure what the peanut flavored thing was. I mean, presumably it was peanut, but what kind of peanut? It was more like a peanut relish than a peanut. It certainly didn’t look like a peanut. Anyway, on with the tech talk…
At Facebook, when the staff try to access facebook.com, the staff actually access latest.facebook.com – this is the latest code, deployed onto some beta servers. This way, the staff are acting like testers. What’s particularly useful about this is how easy they have made it for users to report bugs. You can even assign them to individual devs. I think it’s this “usability” which is lacking in most places. Many of us can access demo sites etc but actually capturing and reporting defects really isn’t a click-of-a-button thing, and it’s this barrier which Facebook have tried to overcome. I would love it if I could access my latest system that easily, and report a bug simply by clicking a button on the same site.
How Facebook Do Deployments
As Girish started talking about the actual technical details of how Facebook do their deployments, I tucked into a duck spring roll and my third beer. This time I was drinking becks or something similar, which I swiped from a passing ninja.
About 4 years ago, Facebook did deployments using rsync, and so did I! In fact, I know a few places that still do deployments using rsync. It took about an hour for Facebook to deploy their whole site. These days they’ve got about 100 times more servers to push to, and they can do it in minutes. How??
They wouldn’t say.
Just kidding. I’ll get to that in a sec, first they explained some approaches they considered, and why they discounted them. I should at this point mention that they deploy their entire webserver code, rather than just small parts of it in each push. This, in my opinion, is probably why they aren’t doing continuous deployment or continuous delivery. The release of the site is a 1.5Gb binary. So, they looked at binary diffs, but just aren’t that quick, and they looked at multicast, which turned out to be very complicated and a cross-datacentre configuration nightmare. They also looked at peer to peer rsync or scp, but that wasn’t working for them.
What they settled on, as Girish explained while I had another chilli and lemongrass chicken skewer (definitely my favorite), was a torrent push, and I must confess I love this idea.
It works like this, you install torrent clients on your servers, and create a torrent file. Then you simply deploy your torrent to one peer and sit back and admire your work as the peer to peer sharing gathers pace. Absolutely brilliant. I’m so annoyed I didn’t think of this as well.
Their solution was based on opentracker and hrktorrent, and allowed them to push a 418Mb gzip file to 10,000 servers in just 58 seconds, which is roughly the equivalent to 563Gbps!!
Earlier on they said they don’t have a QA team, so when one of their testers, Damien Sereni, came up to give his talk, I got a bit confused. However, they explained that he is the Webdriver guy, and that he’s busy porting their old Watir tests over to Webdriver. I wondered why they were doing this, and obligingly they explained that it was because the Watir code was very separate from the site code and that webdriver allowed them to keep their code together better. I’ve used Watir and webdriver and I can understand what he means, even though it might not sound like a brilliant idea for such a switch.
Facebook use Selenium grid & webdriver hub to scale their tests and speed them up. This allows them to distribute their tests to multiple environments and parallelize their test execution.
This is all pretty easy when you’re testing on computers but it it gets a bit tricky with mobile phones. Back in the day, when the facebook app was separate to the site, it was a pain to deploy and a pain to test. Also you hgad to deal with Apple quite a lot, so you couldn’t really take control of when and how you did deployments. Nowadays the facebook app just renders the website so things are a little different (i.e. easier). That said, automated testing for mobile, and sharing UI tests across platforms remains one of the biggest challenges at Facebook.
It would have been rude to leave without collecting my free T-shirt and Facebook-embossed pint glass, so I stuck around until the end of the talk and took the opportunity to chat with some of the Facebook engineers. One guy explained how they did roll-backs (by keeping the old code on the site and repointing a symlink) and another guy explained how they manage schema changes (by keeping the schema really really simple, and abstracting). Also, I took the opportunity to speak with one of the ninja waitresses and asked her what was in the pineapple and peanut snack. The answer: Pineapple and peanut. I had a halloumi cheese skewer (delicious) and left.
Apart from making me hungry, this is a great insight into how the big boys do deployments. Wish I was back living in the smoke so I could attend such events….down here in Auckland, NZ we don’t have such privileged access!
Have to agree with you on the “push karma”, simple but effective in encouraging a behaviour that is valued. Ownership for your actions and awareness of the impact you have and can make both contribute to why Facebook is successful.
Faster feedback is provided via testing using ‘latest.facebook.com’ – a stroke of genius. Defect resolution and fix at this stage will reduce impact. Am with you on wishing to see this in applications I work on. Too often the gap between dev and initial QA/Testing is too long.
The torrent push and associated speed is impressive. Deployment is one of the ‘dirty’ jobs in most projects. Glad that the big boys are challenging the conventions, picking up on innovative methods and sharing their success (and failures).
Were there any slides made available?
Thanks for sharing your evening out with us, making me hungry and for the insight into the way the big boys do it!
I haven’t seen any information about the slides from this particular talk being made available, but a lot of the content seems to have been covered in this earlier talk: https://www.facebook.com/video/video.php?v=10100259101684977
I see their deployment process has changed a bit now. Twitter uses torrents to push their code and interesting to know FB adapted the same way. Have they mentioned on how they invoke the new torrents on the servers??
Thanks for sharing. Very valuable information. As a matter of interest how would a junior build and release engineer go about getting invited to these types of events?
Hi Sion, that’s a good question. Most of the events I attend are things I sign up for either via LinkedIn or I get emailed about from places such as Skills Matters. On this particular occasion a friend of mine invited me. Maybe I should do a regular post on upcoming events. Would that be helpful?
Great writeup and it’s fascinating to know about the torrent deployment. I enjoyed how you interleaved some suspense by narrating the food.
It seems to me that the only thing they wouldn’t reveal is what’s in the pineapple and peanut snack. That must be their secret sauce!! The technical details are just appetizers.
more snack based blog posts please.
Hi James, nice write up. Thanks for taking the time to attend the talk, it was great to see so many folks come out. I also loved the quality of the questions. This is a pretty good summary of the talk, like the amount of details you captured. Seems like I missed out on the tasty snacks, too busy talking.
I can provide more information on the latest tier. Our latest tier is updated as often as we can. We have multiple machines kicking off builds and pushing the binaries in infinite loops and a huge distcc cluster busy compiling the c++ files.
By the way, could you correct the spelling of Damien’s first name?
@Kishore – It’s interesting, the twitter guys came up with the torrent push at the same time as we did. They just had a fairly independent system so they could open source Murder (their torrent deploy system). Our system (as I explain in the talk) was grafted over our existing system, that’s why it’s closed source. To answer your question about the torrent files getting to the hosts, we use the underlying rsync/scp method to copy over the tiny torrent files and then let the hosts download the larger files via torrent.
I heard that the test of there latest code is done by their own developers and they have the latest code available on the we which is behind the white spaces. The latest code can be seen by selecting the screen.
Great to hear from you! How are you doing??
Yes, that’s correct, all the internal staff see the latest facebook changes when they log on to facebook from inside the office network. Pretty neat implementation of the “eat your own dogfood” principle. I still love the torrent deployment idea. Very jealous we didn’t think of that first!
An answer from an expert! Thanks for cogitiburtnn.