Continuous Deployment with Thick Clients

(Update: Part two is up now. It details the technical changes that are required to make this work.)

Earlier this week Eric Ries and Timothy Fitz posted on a technique called Continuous Deployment that they use at IMVU.  Timothy describes it like this:

The high level of our process is dead simple: Continuously integrate (commit early and often). On commit automatically run all tests. If the tests pass deploy to the cluster. If the deploy succeeds, repeat.

Those posts seem to be mostly about deploying new code to web servers. I thought it would be interesting to consider what it would take to follow a similar process for an MMO with a big installed client. I will do that in two parts with today’s part being the deployment process itself.  In a future post I will describe the architectural changes you would have to make to not have frequent updates annoy your players enough that they riot and burn down your datacenter (or go play someone else’s game instead.)

As a point of comparison, here’s how patches are typically developed and deployed for Pirates of the Burning Sea:

  1. The team makes the changes. That takes three weeks for monthly milestone builds with a lot of the integration happening in the last week. For a hotfix it usually takes a few hours to get to the root cause of the problem and then half an hour to make the change, build, and test on the developer’s machine.
  2. A build is run on the affected branch. When that’s a full build it takes 1:20 hours. Incremental builds can be ten minutes if no major headers changed. The form of build that people on the dev team use is copied to a shared server at this point.
  3.  Some quick automated tests are run to fail seriously broken builds as early as possible. These take 10 minutes.
  4. The build is packed up for deployment. This takes 80 minutes. The build is now ready for manual testing (since the testers use the packed retail files to be as close to a customer install as possible.)
  5. Slow automated tests run. These take 30 minutes.
  6. (concurrent with 5 if necessary) The testers run a suite of manual tests called BVTs. These take about half an hour. These tests run on each nightly build even if it isn’t going to be deployed to keep people from losing much time to bad builds. Most people wait for the BVTs to finish before upgrading to the previous night’s build.
  7. For milestone builds the testers spend at least two weeks testing all the new stuff. Builds are sent to SOE for localization of new content at the same time.
  8. A release test pass is run with a larger set of tests. These take about three hours.
  9. The build is uploaded to SOE for client patching and to the datacenter for server upgrades. This takes a few minutes for a hotfix or 6-8 hours for a month’s worth of changes.
  10. At this point the build is pushed to the test server. Milestone builds sit there for a couple weeks. Hot fixes might stay for a few hours or even go straight to live in an emergency.
  11. SOE pushes the patch to their patch servers without activating the patch for download. That takes between one and eight hours depending on the size of the patch.
  12. At the pre-determined downtime, the servers are taken offline and SOE is told to activate the patch. This usually happens at 1am PST and doesn’t take any time. Customers can begin downloading the patch at this point.
  13. Flying Lab operations people upgrade all the servers in parallel. This takes up to three hours for milestone builds, but more like an hour for hotfixes. They bring the servers up locked so that they can be tested before people come in.
  14. The customer service team tests each server to make sure it’s pretty much working and then operations unlocks the servers.  That takes about 20 minutes.
  15. If the build was being deployed to the test server, after some days of testing steps 11-14 will happen again for the live servers.

This may seem like a long, slow process but it actually works pretty well.  Pirates has had 11 “monthly” patches since it launched, and they go well.  Not many other MMOs are pushing content with that frequency.  Some of the slow bits (like the upload to SOE and their push to their patchers) are the result of organizational handoffs that would take quite a bit of work to change. Flying Lab also spends a fair amount of time testing manually over and above the automated tests. Those manual tests have kept thousands of bugs from going live and as far as I’m concerned they are an excellent use of resources. I am not trying to bash Flying Lab in any way, just provide a baseline for what an MMO deployment process looks like. I suspect many other games have a similar sequence of events going into each patch, and would love to hear confirmation or rebuttal from people who have experience on other games.

Assuming minimum time for a patch, these are the times we are left with (in hours and minutes):

Pirates Deployment Time

0:10 Run incremental build
0:10 Run quick automated tests
1:20 Pack the build
0:30 Run quick manual tests/slow automated tests
3:00 Run release pass smoke tests
0:05 Upload build for patching
1:00 Build is pushed to patchers
1:00 Servers are upgraded
0:20 A final quick test is made on each server
7:35 Total time to deploy

If it takes seven and a half hours to deploy a new build you obviously aren’t going to get more than one of them out in an 8 hour work day. Let’s forget for a moment that IMVU is able to do this in 15 minutes and pretend that our target is an hour. For now let’s assume that we will spend 10 minutes on building, 10 minutes on automated testing, 30 minutes on manual testing, and the last 10 minute actually deploying the build. We will also assume that this is the time to deploy to the test server and that this is the slower than the actual push from test to live.

Without going into detail I am going to assume three architectural changes to make these targets much more likely. First, the way that the game uses packed files of assets will need to be changed to allow trivial incremental changes on every file in the game without any 80 minute pack processes. I will assume that pack time can go away and be a minor part of deployment.  The other assumption is that unlike Pirates, our theoretical continuously deployed MMO will not require every object in the database to be touched when a new build is deployed so the hour spent upgrading databases on the servers is reduced to about five minutes of file copying. The third architectural change I will assume involves moving almost all of the patching process into the client itself and out of the launcher. This eliminates both the transfer of bits to Southern California and the push of pre-built patches out to the patcher clusters around the world. I will go into each of these assumptions in my next post, but let’s just pretend for now that they appear magically and reduce our time accordingly.

The amount of time spent on automated testing in this deployment process is 40 minutes. Fortunately this work is relatively easy to spread out over multiple machines to reduce the total time involved. Assuming a test farm of at least five machines we should be able to accomplish our ten minute goal.

It is possible that we could have multiple testers working in parallel to perform the manual tests more quickly. There are 230 minutes spent on testing in this sequence, so if we expect to do that amount of work with 11.5 manual testers assuming perfect division of labor. There are two things wrong with that statement. The first is that perfect division of labor among 12 people is impossible so the number is probably closer to 20.  The second problem is that having a team of 12 on staff to do nothing but test new builds with tiny incremental changes in them is not worth nearly what it costs you in salaries. A much more likely outcome is that the amount of manually testing in the build process is drastically reduced. In addition to that we can get our playerbase involved in the deployment testing if we swap things around a bit.

Trimmed Deployment Time

0:10 Run incremental build
0:10 Run automated tests on five machines
0:05 Upload only changed files to game servers and patch servers
0:05 Deploy build on test server
0:30 Manual smoke test on public test on public test server
Players can test at the same time.
1:00 Total time to deploy

In fact, if those 30 minutes of manual testing are your bottleneck and you can keep the pipeline full, you can push a fresh build every 30 minutes or 16 times a day. Forget entirely about pushing to live for a moment and consider what kind of impact that would have on your test server. Your team could focus on fixing issues that players on the test server find while those players are still online. Assuming a small change that you can make in half an hour, it would take only an hour from the start of the work on that fix to when it is visible to players. That pace is fast enough that it would be possible to run experiments with tuning values, prices of items, or even algorithms.Of course for any of this to work the entire organization needs to be arranged around responding to player feedback multiple times per day. The real advantage of a rapid deployment system is to make your change -> test -> respond loop faster. If your plans are set a month out and your developers don’t have time to work on anything other than their assigned tasks for that entire time, there is not much point in pushing builds this quickly.What do you think?  Obviously actually implementing this is a more work than moving numbers around in a table. :)


8 Responses to “Continuous Deployment with Thick Clients”

  1. Ben Zeigler thought on :

    Our build/deploy is a fair amount like POTBS, with the following changes:

    1) We run the patchservers, and our patchservers handle incremental data uploads, so the build to patcher push is much faster for most builds. It also happens at build time automatically, so when we push an incremental build most of the files are already up on the patch servers.
    2) Our full builds max out at 30 minutes or so. Data-packing can take a good bit longer, but that’s incremental so doesn’t happen most of the time.
    3) Server upgrading uses an identical process to client patching (grabbing a different subset of files off the patchserver). Also, this means end users can pre-patch at the same time the servers are getting ready in theory.

    I’ve run a complete deploy process of a prototype project (ie, much less data then PotBS but just as much code, so the patching is faster than it would be for a real game) in about 40 minutes from a completely new build to externally accessible servers.

  2. Joe said on :

    Did you run the patch servers in the CoH days, or did they run down at NC Austin and cause a similar two-stage process?

  3. Ben Zeigler replied on :

    I believe they hosted them, but the patchservers were our own code so we had control over the process. I know servers patched the same as clients, but the details of that predate me.

    There was much politicking required to let us use our own written patchservers, as opposed to the “approved” ones.

  4. Kevin Gadd commented on :

    It’s great to hear that POTBS is able to come close to the continuous deployment ideal for an MMO. I’m also surprised to hear that Cryptic’s build process is so streamlined – I never would have guessed from my experience playing CoH/CoV, but maybe that’s just because I started playing after the game was fairly mature, so the major changes came in the Publishes.

    Continous deployment has a fundamental impact on how you think about game deployment and patching, compared to the old-fashioned deployment approach where you focus on building gold masters, but it totally pays off.

    My experience with semi-continous deployment on Guild Wars definitely left me hungering for more, and the horror stories about weekly builds that I heard from some of my more experienced coworkers made me wonder why more companies hadn’t tried to get their build times down – it’s really great to hear that other people are trying as hard as the ANet crew were.

    There were cases where being able to deploy more than once a day meant that we could tackle an emergency in the span of a day or weekend instead of having to crunch away for weeks, by simply putting together a potential solution, rolling it out, and seeing if the problem went away, instead of having to spend days or weeks waiting on builds and exhaustive QA.

    On the other hand, there’s still a lot more ground to be gained, so I hope people like you keep working hard on improvements in this area – it’s amazing how much better the quality of life can be for hardworking artists, designers and engineers when the consequences of a failed build or bad deployment are measured in hours of stress instead of days or weeks. In an industry that’s somewhat famous for bad work conditions and endless crunch time, I think that’s a really big deal.

  5. Joe wrote on :

    Is 7.5 hours really close to the ideal? :)

    I would love to hear more about how this process worked on Guild Wars. It seems like the ability to download content from within the client and to operate with a subset of the data would be a big help and GW could do both of those things.

  6. Kevin Gadd commented on :

    Well, my ideal is closer to 30 minutes, but from the perspective of a gamer, a 7 hour deploy is pretty damn good if it includes things like publisher approval and QA turnaround.

    The speed of light starts becoming a serious barrier when you have as much content to deploy as a typical MMO; IMVU gets off a little easy since our deployments are almost always in the 5-50MB range, so we don’t run into bandwidth issues when getting stuff onto the servers.

    I’ll try and share what I remember about GW’s deployment process: (Disclaimer: I spent most of my time working on the content pipeline and designing game content, so I’m not a domain expert here. Some of this is probably completely wrong.)

    Internal ‘development’ builds took between 15 and 60 minutes, depending on whether we were building large content changes or just code changes. After we had a finished dev build, we could test it on local machines and on a small development cluster that our QA and alpha test groups had access to. From development we did a merge over to our ‘staging’ (production rev) branch, and did another build in about the same amount of time. Once that was done, something resembling a one-button deploy moved the content and game code from staging to our production machines, which took under an hour – most of the time being spent actually sending the bits to the datacenter(s), so the time varied based on how much had changed.
    Once the new code was on all the servers in the datacenters, we flipped a switch that notified the live servers that any new ‘instances’ should be started using the latest version of the game code. At this point, any users running older builds of the client got told that they needed to update, and attempts to load into a new instance would be rejected until they updated.
    To summarize how the live update worked on the server, I believe we atomically loaded the new game content onto the servers, and then loaded the newest version of the game code into the server processes alongside the older version, because each build was a single loadable DLL. That let us keep old instances alive alongside new instances on the same hardware, and isolated us from some of the hairy problems you’d get from multiple client versions running in the same world.
    All game content was stored using a versioned filesystem of some sort on both the servers and the clients, so the update process was fairly efficient and incremental, with our fileservers doing the work of figuring out where revision foo of file bar happened to live and sending it off to the client.

    As for the client side, here’s what I remember:
    All content (with a few exceptions) was streamed down to the client ‘as required’, based on a set of rules about when content was needed:
    There was one set of content that was essentially “required” to run the game. If all you had was the game client, it would automatically figure out that it needed the required content and pull it down for you. After this, you got from the loader/updater screen into the login screen. Things like UI textures, fonts, etc were in here, since the client itself used these resources.
    The next set of content was essentially ‘universal’ content. This stuff wasn’t necessary to start up the client, but was basically required for loading into any instance where players could be running around – I think it contained things like the basic human skeletons, common textures, etc. This would be streamed on demand at the loading screen for a given instance once you logged in or changed zones.
    Finally, there was a set of content needed for the actual location you were in. The game code had a lot of really elaborate scaffolding that made it possible for us to statically determine what assets were needed by a given instance. From this, we built a ‘manifest’ that listed all the assets the game client could *possibly* need in that instance, and the client loaded them all before leaving the loading screen. This was particularly painful in some respects (for example, if you had a monster that spawned 5% of the time, 100% of your players would have to download the assets for it because manifests couldn’t safely be dynamic in such a manner), but it really did work well when executed correctly. The successes and failures in this particular area depended as much on careful design as they did on clever technology.
    There was also some magic for things like skill icons and character textures, where we only required the download of low-res subsets of the data, and we would stream in the high-res versions on demand as soon as they were actually loaded. I’m not certain exactly where we used it, but I know that you’d sometimes see the textures for a person’s armor ‘pop’ from low-res to high-res as we streamed them in over the network if you didn’t already have them locally. It was hard to tell exactly when this happened, though, since we also did lots of asynchronous loading from disk that tended to look like network streaming if you were on a good connection. This primarily helped for the new user case, since it meant they could download Gw.exe off our website in less than a minute and have enough content loaded to enter a town within an hour or less.

    The manifest system tied in to the build process pretty directly, and I suspect it was responsible for the short build times. If a build only took 15 minutes, that usually meant that none of the content had to be updated – and since everything was built around incremental updates, that meant we only had to deploy binaries to the servers and users’ machines, and all the build server needed to do was crunch through a bunch of source code using burly hardware. In degenerate cases where a bunch of content changed, there was a lot more time involved, but it still scaled nicely. We also had the ability to do incremental builds of source code, but we didn’t use it for deployment in fear of linker issues (we used it on local development boxes, and while it was much faster, the linker issues were real.)

    Based on my experience with GW, if I wanted to get POTBS’s build times down (as a contrived example), I’d try and kill the ‘server upgrade’ stage first by making it asynchronous in a manner like what I described above. Improving the speed of pushing out a build is valuable but not particularly easy to do, so I don’t expect there are many wins to be made there. You could cut down on the length of smoke tests, but I suspect less than 3 hours is irrational – I wouldn’t be surprised if at least that much time went into testing GW deployments and I just didn’t see it happen. Packing the build seems like it shouldn’t take an hour, but the elaborate manifest/versioned filesystem setup we had meant that there was no ‘packing’ process, so I don’t quite know what went into yours. I suspect the overhead of having to collaborate with SOE probably factors in here too; I saw very little NCSoft red tape involved in smaller deployments.

    I may regret posting this comment. (:

  7. Bryant thought on :

    I don’t think continuous deployment is a good idea, speaking as an MMO operations guy, because it makes the troubleshooting process substantially more difficult. I use the Visible IT methodology when possible, and it keys off of change management — anything that makes the changes that triggered the latest problem harder to identify is bad. The first reaction might be that continuous deployment doesn’t make that harder if it’s well tracked, but you have to assume some changes cause problems which don’t manifest immediately.

    On the other hand, I also strongly believe that the industry’s dependence on weekly or even daily downtimes is bad for us and we need to fix it. Zero downtime is a legitimate goal. Linden Labs can do updates to Second Life without taking everything down for hours; why can’t the rest of us? Sounds like Guild Wars is doing something really similar to what I’ve mentally sketched out as a process, which is neat.

    I gotta blog about both these things at some point.

1 Pingback

  1. Pingback from Audio Article #132: Continuous Deployment | IndustryBroadcast on :

    [...] A: “Continuous Deployment“ – Original [...]

Leave a Reply